=Paper= {{Paper |id=Vol-2657/xproceedings |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-2657/xproceedings.pdf |volume=Vol-2657 }} ==None== https://ceur-ws.org/Vol-2657/xproceedings.pdf

Proceedings of the
ACM SIGKDD Workshop on
Knowledge-infused Mining and Learning for Social Impact
Editors: Manas Gaur, Alejandro Jaimes, Fatma Özcan, Srinivasan Parthasarathy,
Sameena Shah, Amit Sheth & Biplav Srivastava

KiML 2020
AUGUST 24
SAN DIEGO , CA

First International Workshop on Advancing Decision
making in Health, Crisis Response, and Finance

Co-located with
http://kiml2020.aiisc.ai/
26th ACM Conference on
Knowledge Discovery and Data Mining
KDD 2020, San Diego, California
Proceedings of the
ACM SIGKDD Workshop on
Knowledge-infused Mining and Learning (KiML)

Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee.

These proceedings are not included in the ACM Digital Library.

KiML’20, August 24, 2020, San Diego, California, USA.

Copyright (c) 2020 held by the author(s). In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the ACM SIGKDD 2020
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, California, USA, August 24, 2020. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
Copyright © The Authors, 2020.

ACM SIGKDD Workshop on
Knowledge-infused Mining and Learning (KiML)
Organizers:
Manas Gaur (AI Institute, University of South Carolina)
Alejandro(Alex) Jaimes (Dataminr Inc. NYC)
Fatma Özcan (IBM Research Almaden)
Srinivasan Parthasarathy (Ohio State University)
Sameena Shah (JP Morgan NYC)
Amit Sheth (AI Institute, University of South Carolina)
Biplav Srivastava (IBM Chief Analytics Office, NYC)

Program Committee:
Nitin Agarwal (University of Arkansas)
Amanuel Alambo (Kno.e.sis Center)
Shreyansh Bhatt (Amazon)
Vasilis Efthymiou (IBM Research)
Utkarshani Jaimini (AI Institute, University of South Carolina)
Ugur Kurşuncu (AI Institute, University of South Carolina)
Sarasi Lalithsena (IBM Watson)
Chuan Lei (IBM Research)
Quanzhi Li (Alibaba Group)
Xiaomo Liu (S&P Global Ratings)
Yong Liu (Outreach.io)
Raghava Mutharaju (IIIT Delhi)
Arindam Pal (Data61, CSIRO)
Sujan Perera (Amazon)
Hemant Purohit (George Mason University)
Kaushik Roy (AI Institute, University of South Carolina)
Valerie Shalin (Wright State University)
Kai Shu (Arizona State University)
Nikhita Vedula (Ohio State University)
Ruwan Wickramarachchi (AI Institute, University of South Carolina)
Ke Zhang (Dataminr Inc.)
Jinjin Zhao (Amazon)

Webmaster:
Vishal Pallagani (AI Institute, University of South Carolina)
Ibrahim Salman (AI Institute, University of South Carolina)
Copyright (c) 2020 held by the author(s). In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the ACM SIGKDD 2020
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, California, USA, August 24, 2020. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
Preface
Research in artificial intelligence and data science is accelerating rapidly due to an unprecedented
explosion in the amount of information on the web. In parallel, we noticed immense growth in the
construction and utility of the knowledge network from Google, Netflix, NSF, and NIH. However, current
methods risk an unsatisfactory ceiling of applicability due to shortcomings in bringing homogeneity
between knowledge graphs, data mining, and deep learning. In this changing world, retrospective studies
for building state-of-the-art AI and Data science systems have raised concerns on trust, traceability, and
interactivity for prospective applications in healthcare, finance, and crisis response. We believe the
paradigm of knowledge-infused mining and learning would account for both pieces of knowledge that
accrue from domain expertise and guidance from physical models. Further, it will allow the community to
design new evaluation strategies that assess robustness and fairness across all comparable
state-of-the-art algorithms.

The Workshop on Knowledge-infused Mining and Learning for Social Impact was centered around the
following thematic components: (a) Data Management: includes resource management, resource
discovery across heterogeneous and inconsistent data resources. (b) Data Usage: includes methods and
systems for visualization, representations, reasoning, and interaction. (c) Evaluation: will bring together
researchers involved at the intersection of databases, semantic web, information systems, and AI to
create new approaches and tools to benefit a broad range of policymakers (e.g. mental health professions,
education practitioners, emergency responders, and economists).

The workshop will bring together researchers and practitioners from both academia and industry who
are interested in the creation and use of knowledge graphs in understanding online conversations on
crisis response (e.g., COVID-19), public health (e.g., social network analysis for mental health insights),
and finance (e.g., mining insights on the financial impact (recession, unemployment) of COVID-19 using
twitter or organizational data). Additionally, we encourage researchers and practitioners from the areas
of human-centered computing, interaction and reasoning, statistical relational mining and learning,
intelligent agent systems, semantic social network analysis, deep graph learning, and recommendation
systems.

The main program of KiML’20 consist of seven papers, selected out of thirteen submissions, covering
topics related to knowledge-enabled feature elicitation, adversarial learning, crisis response, public
health, and COVID-19. We sincerely thank the authors of the submissions as well as the attendees of the
workshop. We wish to thank the members of our program committee for their help in selecting
high-quality papers. Furthermore, we are grateful to Manuela Veloso, Sriraam Natarajan, Jose Ambite, and
Pieter De Leenheer for giving keynote presentations on their recent work on Symbiotic Autonomy,
Human Allied Probabilistic Learning, Biomedical Data Science, and Data Intelligence.

Manas Gaur, Alejandro Jaimes, Fatma Özcan, Srinivasan Parthasarathy,
Sameena Shah, Amit Sheth, and Biplav Srivastava
August 2020
Copyright (c) 2020 held by the author(s). In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the ACM SIGKDD 2020
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, California, USA, August 24, 2020. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
Table of Contents
Invited Talks

Symbiotic Autonomy: Knowing When and What to Learn from Experience
Manuela M. Veloso ……………………………………………………………………………………... 1

Human Allied Probabilistic Learning
Sriraam Natarajan …………………………………………………………………………………….. 2

Data Intelligence in the 2020s
Pieter De Leenheer …………………………………………………………………………………….. 3

Semantics in Biomedical Data Science
Jose Luis Ambite ....……………………………………………………………………………………... 4

Research Papers

Textual Evidence for the Perfunctoriness of Independent Medical Reviews
Adrian Brasoveanu, Megan Moodie and Rakshit Agrawal ……………………………………………5

Knowledge Intensive Learning of Generative Adversarial Networks
Devendra Dhami, Mayukh Das and Sriraam Natarajan ……………………………………………. 14

Depressive, Drug Abusive, or Informative: Knowledge-aware Study of News Exposure during
COVID-19 Outbreak
Amanuel Alambo, Manas Gaur and Krishnaprasad Thirunarayan………………………………….20

Cost Aware Feature Elicitation
Srijita Das, Rishabh Iyer and Sriraam Natarajan……………………………………………………26

A New Delay Differential Equation Model for COVID-19
B Shayak, Mohit Manoj Sharma and Manas Gaur…………………………………………………....32

Public Health Implications of a delay differential equation model for COVID19
Mohit Manoj Sharma and B Shayak……………....…………………………………………………....36

Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of
Crisis Tweets
Jitin Krishnan, Hemant Purohit, Huzefa Rangwala…………………………………………………..42
Keynote Talk 1
Symbiotic Autonomy: Knowing When and What to Learn from Experience

Manuela M. Veloso
Head, JPMorgan AI Research
Herbert A. Simon University Professor, School of Computer Science
Carnegie Mellon University
manuela.veloso@jpmchase.com

Abstract:

The talk will present work on novel human-AI interaction, in which humans and AI complement
each other in their knowledge and learning. I will discuss examples in autonomous mobile
service robots and in the financial domain. I will conclude with a brief discussion of multiple
forms of available knowledge for AI systems that continuously learn from experience.

Bio:
Manuela M. Veloso is the Head of J.P. Morgan AI Research, which pursues fundamental research
in areas of core relevance to financial services, including data mining and cryptography, machine
learning, explainability, and human-AI interaction. J.P. Morgan AI Research partners with
applied data analytics teams across the firm as well as with leading academic institutions
globally. Professor Veloso is on leave from Carnegie Mellon University as the Herbert A. Simon
University Professor in the School of Computer Science, and the past Head of the Machine
Learning Department. With her students, she had led research in AI, with a focus on robotics and
machine learning, having concretely researched and developed a variety of autonomous robots,
including teams of soccer robots, and mobile service robots. Her robot soccer teams have been
RoboCup world champions several times, and the CoBot mobile robots have autonomously
navigated for more than 1,000km in university buildings. Professor Veloso is the Past President
of AAAI, (the Association for the Advancement of Artificial Intelligence), and the co-founder,
Trustee, and Past President of RoboCup. Professor Veloso has been recognized with multiple
honors, including being a Fellow of the ACM, IEEE, AAAS, and AAAI. She is the recipient of
several best paper awards, the Einstein Chair of the Chinese Academy of Science, the
ACM/SIGART Autonomous Agents Research Award, an NSF Career Award, and the Allen Newell
Medal for Excellence in Research. Professor Veloso earned a Bachelor and Master of Science
degrees in Electrical and Computer Engineering from Instituto Superior Tecnico in Lisbon,
Portugal, a Master of Arts in Computer Science from Boston University, and Master of Science
and Ph.D. in Computer Science from Carnegie Mellon University. See
www.cs.cmu.edu/~mmv/Veloso.html for her scientific publications.
Keynote Talk 2
Human Allied Probabilistic Learning

Sriraam Natarajan
Director, Center for Machine Learning
Erik Jonsson School of Engineering and Computer Science
The University of Texas at Dallas
sriraam.natarajan@utdallas.edu

Abstract:

Historically, Artificial Intelligence has taken a symbolic route for representing and reasoning
about objects at a higher-level or a statistical route for learning complex models from large data.
To achieve true AI, it is necessary to make these different paths meet and enable seamless
human interaction. First, I briefly will introduce learning from rich, structured, complex, and
noisy data. Next, I will present the recent progress that allows for more reasonable human
interaction where the human input is taken as “advice” and the learning algorithm combines this
advice with data. The advice can be in the form of qualitative influences, preferences over
labels/actions, privileged information obtained during training, or simple precision-recall
trade-off. Finally, I will outline our recent work on "closing-the-loop" where information is
solicited from humans as needed that allows for seamless interactions with the human expert.
While I will discuss these methods primarily in the context of probabilistic and relational
learning, I will also present our results on reinforcement learning and inverse reinforcement
learning.

Bio:

Dr. Sriraam Natarajan is an Associate Professor and the Director of the Center for ML at the
Department of Computer Science at the University of Texas Dallas. He was previously an
Associate Professor and earlier an Assistant Professor at Indiana University, Wake Forest School
of Medicine, a post-doctoral research associate at the University of Wisconsin-Madison, and had
graduated with his Ph.D. from Oregon State University. His research interests lie in the field of
Artificial Intelligence, with emphasis on Machine Learning, Statistical Relational Learning and AI,
Reinforcement Learning, Graphical Models, and Biomedical Applications. He has received the
Young Investigator award from US Army Research Office, Amazon Faculty Research Award, Intel
Faculty Award, XEROX Faculty Award, Verisk Faculty Award, and the IU Trustees Teaching
Award from Indiana University. He is the program co-chair of SDM 2020 and ACM CoDS-COMAD
2020 conferences. He is the specialty chief editor of Frontiers in ML and AI journal, an editorial
board member of MLJ, JAIR, and DAMI journals and is the electronics publishing editor of JAIR.
Keynote Talk 3
Data Intelligence in the Age of Accountability

Pieter De Leenheer
Senior Research Fellow, Harvard Business School
Co-Founder and Chief Science Officer, Collibra Inc.
pdeleenheer@hbs.edu

Abstract:

Knowledge graphs, machine learning and distributed ledgers are just a few of the emerging
intelligent technologies that unlock new options to innovate business models, augment scientific
knowledge and self-understanding, and enhance decision making. Data being a critical driver for
intelligent systems implies machine calculation may supplant human decision making in many
scenarios. The accessibility, quality and currency of data are necessary criteria to ensure these
systems produce viable innovation options that can be accounted for. But are these criteria
sufficient?

Bio:

Pieter is a senior research fellow at Harvard Business School and serves as adjunct faculty at
Columbia University. He is a cofounder and former Chief Science Officer of Collibra, a unicorn
venture in data intelligence, that spun off his PhD research on community-based ontology
management. Pieter writes, teaches and advises on computing and management aspects of data
innovation, accountability and citizenship. He serves as an expert to the European Commission
and several governments; and as board member of several startups such as Gluetech.com and
Yesse.tech. Prior to cofounding the company, Pieter was a professor at VU University of
Amsterdam. He lives in New York City with his family.
Keynote Talk 4
Semantics in Biomedical Data Science

Jose Luis Ambite
Research Team Leader, Information Sciences Institute
Associate Research Professor, University of Southern California
ambite@isi.edu

Abstract:
There is an explosion of biomedical data that promises to enable novel discoveries, treatments,
and the ultimate goal of personalized medicine. These data are generated in a great variety of
forms, ranging from sensor data, to imaging, to genetics, and all types of clinical data. Moreover,
the data are often scattered across organizations, and even for the same data type are
represented in diverse structures. Thus, the need to provide a semantically consistent view, so
that the data can be meaningfully analyzed is critical. I will describe core data integration and
knowledge graph construction techniques, namely entity linkage and formal schema mappings,
with illustrative biomedical data integration applications, highlighting some novel neural
semantic similarity methods and some surprising applications of record linkage techniques, such
as efficiently finding genetically related individuals. I will discuss architectures for large scale
data integration and analysis, including sensor data. Finally, I will discuss how we can analyze
distributed datasets when the data cannot be shared for privacy or security reasons, and thus
cannot be integrated. I will describe our recent work on Heterogeneous Federated Learning that
learns common neural models from siloed data.

Bio:
Dr. Jose Luis Ambite is an Associate Research Professor at the Computer Science Department,
and a Research Team Leader at the Information Sciences Institute, at the University of Southern
California. His core expertise is on information integration, including query rewriting under
constraints, learning schema mappings, and entity linkage. Dr. Ambite research interests include
databases, knowledge representation, semantic web, semantic similarity, scientific workflows,
and biomedical data science. He has published widely in these topics. He regularly serves as
reviewer for funding organizations, journals and major conferences. In the last years, he has
focused on developing novel approaches for integration, analysis, and dissemination of
biomedical and genetic data within several large NIH-funded projects, such as PRISMS-study,
NIMH Repository and Genetics Resource, SchizConnect, Population Architecture using Genomics
and Epidemiology, and Education Resource Discovery Index.
Textual Evidence for the Perfunctoriness of Independent
Medical Reviews
Adrian Brasoveanu Megan Moodie Rakshit Agrawal
abrsvn@ucsc.edu mmoodie@ucsc.edu ragrawal@camio.com
University of California Santa Cruz University of California Santa Cruz Camio Inc.
Santa Cruz, CA Santa Cruz, CA San Mateo, CA
ABSTRACT ACM Reference Format:
We examine a database of 26,361 Independent Medical Reviews Adrian Brasoveanu, Megan Moodie, and Rakshit Agrawal. 2020. Textual Ev-
idence for the Perfunctoriness of Independent Medical Reviews. In Proceed-
(IMRs) for privately insured patients, handled by the California
ings of KDD Workshop on Knowledge-infused Mining and Learning (KiML’20).
Department of Managed Health Care (DMHC) through a private , 9 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
contractor. IMR processes are meant to provide protection for pa-
tients whose doctors prescribe treatments that are denied by their
health insurance (either private insurance or the insurance that is
1 INTRODUCTION
part of their worker comp; we focus on private insurance here). 1.1 Origin and structure of IMRs
Laws requiring IMR were established in California and other states Independent Medical Review (IMR) processes are meant to provide
because patients and their doctors were concerned that health in- protection for patients whose doctors prescribe treatments that are
surance plans deny coverage for medically necessary services. We denied by their health insurance – either private insurance or the
analyze the text of the reviews and compare them closely with a insurance that is part of their workers’ compensation. In this paper,
sample of 50000 Yelp reviews [19] and the corpus of 50000 IMDB we focus exclusively on privately insured patients. Laws requiring
movie reviews [10]. Despite the fact that the IMDB corpus is twice IMR processes were established in California and other states in
as large as the IMR corpus, and the Yelp sample contains almost the late 1990s because patients and their doctors were concerned
twice as many reviews, we can construct a very good language that health insurance plans deny coverage for medically necessary
model for the IMR corpus using inductive sequential transfer learn- services to maximize profit.1
ing, specifically ULMFiT [8], as measured by the quality of text As aptly summarized in [1], IMR is regularly used to settle dis-
generation, as well as low perplexity (11.86) and high categorical putes between patients and their health insurers over what is medi-
accuracy (0.53) on unseen test data, compared to the larger Yelp cally necessary or experimental/investigational care. Medical ne-
and IMDB corpora (perplexity: 40.3 and 37, respectively; accuracy: cessity disputes occur between health plans and patients because
0.29 and 0.39). We see similar trends in topic models [17] and clas- the health plan disagrees with the patient’s doctor about the ap-
sification models predicting binary IMR outcomes and binarized propriate standard of care or course of treatment for a specific
sentiment for Yelp and IMDB reviews. We also examine four other condition. Under the current system of managed care in the U.S.,
corpora (drug reviews [6], data science job postings [9], legal case services rendered by a health care provider are reviewed to de-
summaries [5] and cooking recipes [11]) to show that the IMR re- termine whether the services are medically necessary, a process
sults are not typical for specialized-register corpora. These results referred to as utilization review (UR). UR is the oversight mech-
indicate that movie and restaurant reviews exhibit a much larger anism through which private insurers control costs by ensuring
variety, more contentful discussion, and greater attention to detail that only medically necessary care, covered under the contractual
compared to IMR reviews, which points to the possibility that a terms of a patient’s insurance plan, is provided. Services that are
crucial consumer protection mandated by law fails a sizeable class not deemed medically necessary or fall outside a particular plan
of highly vulnerable patients. are not covered.
Procedures or treatment protocols are deemed experimental or
CCS CONCEPTS investigational because the health plan – but not necessarily the
• Computing methodologies → Latent Dirichlet allocation; patient’s doctor, who in many cases has enough clinical confidence
Neural networks. in a treatment to order it – considers them non-routine medical
care, or takes them to be scientifically unproven to treat the specific
KEYWORDS
condition, illness, or diagnosis for which their use is proposed.
AI for social good, state-managed medical review processes, It is important to realize that the IMR process is usually the
language models, topic models, sentiment classification third and final stage in the medical review process. The typical
progression is as follows. After in-person and possibly repeated
examination of the patient, the doctor recommends a treatment,
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, 1 For California, see the Friedman-Knowles Act of 1996, requiring California health
California, USA, August 24, 2020. Use permitted under Creative Commons License plans to provide external independent medical review (IMR) for coverage denials. As
Attribution 4.0 International (CC BY 4.0). of late 2002, 41 states and the District of Columbia had passed legislation creating an
KiML’20, August 24, 2020, San Diego, California, USA, IMR process. In 34 of these states, including California, the decision resulting from the
© 2020 Copyright held by the author(s). IMR is binding to the health plan. See [1, 15] for summaries of the political and legal
https://doi.org/10.1145/nnnnnnn.nnnnnnn history of the IMR system, and [2] for an early partial survey of the DMHC IMR data.
KiML’20, August 24, 2020, San Diego, California, USA,
Brasoveanu, Moodie and Agrawal

which is then submitted for approval to the patient’s health plan. to maximize profit, rather than simply maintain cost effectiveness,
If the treatment is denied in this first stage, both the doctor and seems to emerge. Typically, the argument for denial contends that
the patient may file an appeal with the health plan, which triggers the evidence for the beneficial effects of the treatment fails the
a second stage of reviews by the health-insurance provider, for prevailing standard of scientific evidence. This prevailing standard
which a patient can supply additional information and a doctor invoked by IMR reviewers is usually randomized control trials
may engage in what is known as a “peer to peer” discussion with a (RCTs), which are expensive, time-consuming trials that are run by
health-insurance representative. If these second reviews uphold the large pharmaceutical companies only if the treatment is ultimately
initial denial, the only recourse the patient has is the state-regulated estimated to be profitable.
IMR process, and per California law, an IMR grievance form (and RCTs, however, have known limits: they “require minimal as-
some additional information) is included with the denial letter. sumptions and can operate with little prior knowledge [which] is
An IMR review must be initiated by the patient and submitted to an advantage when persuading distrustful audiences, but it is a
the California Department of Managed Health Care (DMHC), which disadvantage for cumulative scientific progress, where prior knowl-
manages IMRs for privately-insured patients. Motivated treating edge should be built upon, not discarded.” [3] Inflexibly applying
physicians may provide statements of support for inclusion in the the RCT “gold standard” in the IMR process is often a way to ig-
documentation provided to DMHC by the patient, but in theory nore the doctors’ knowledge and experience in a way that seems
the IMR creates a new relationship of care between the review- superficially well-reasoned and scientific. “RCTs can play a role in
ing physician(s) hired by a private contractor on behalf of DMHC, building scientific knowledge and useful predictions” – and we add,
and the patient in question. The reviewing physicians’ decision is treatment recommendations – “only [. . . ] as part of a cumulative
supposed to be made based on what is in the best interest of the pa- program, [in combination] with other methods.” [3]
tient, not on cost concerns. It is this relation of care that constitutes Notably, the experimental/investigational category of treatments
the consumer protection for which IMR processes were legislated. that get denied often includes promising treatments that have not
Understandably, given that the patients in question may be ill or been fully tested in clinical RCTs – because the treatment is new or
disabled or simply discouraged by several layers of cumbersome the condition is rare in the population, so treatment development
bureaucratic processes, there is a very high attrition from the initial costs might not ultimately be recovered. Another common category
review to the final, IMR, stage. That is, only the few highly moti- of experimental/investigational denials involves “off-label” drug
vated and knowledgeable patients – or the extremely desperate – uses, that is, uses of FDA-approved pharmaceuticals for a purpose
get as far as the IMR process. other than the narrow one for which the drug was approved.
The IMR process is regulated by the state, but it is actually con-
ducted by a third party. At this time (2019), the provider in Cali- 1.2 Main argument and predictions
fornia and several other states across the US is MAXIMUS Federal Recall that these ‘experimental’ treatments or off-label uses are rec-
Services, Inc.2 The costs associated with the IMR review, at least ommended by the patient’s doctor, and therefore their potential
in California, are covered by health insurers. It is DMHC’s and benefits are taken to outweigh their possible negative effects. The
MAXIMUS’s responsibility to collect all the documentation from recommending doctor is likely very familiar with the often lengthy,
the patient, the patient’s doctor(s) and the health insurer. There tortuous and highly specific medical history of the patient, and with
are no independent checks that all the documentation has actually the list of ‘less experimental’ treatments that have been proven
been collected, however, and patients do not see a final list of what unsuccessful or have been removed from consideration for patient-
has been provided to the reviewer prior to the IMR decision itself specific reasons. It is also important to remember that many rare
(a post facto list of file contents is mailed to patients along with the conditions have no “on-label” treatment options available, since ex-
final, binding, decision; it is unclear what recourse a patient may pensive RCTs and treatment approval processes are not undertaken
have if they find pertinent information was missing from the review if companies do not expect to recover their costs, which is likely if
file). Once the documentation is assembled, MAXIMUS forwards it the potential ‘market’ is small (few people have the rare condition).
to anywhere from one to three reviewers, who remain anonymous, Therefore, our main line of argumentation is as follows.
but are certified by MAXIMUS to be appropriately credentialed
and knowledgeable about the treatment(s) and condition(s) under • Since IMRs are the final stage in a long bureaucratic process
review. The reviewer submits a summary of the case, and also a ra- in which health insurance companies keep denying coverage
tionale and evidence in support of their decision, which is a binary for a treatment repeatedly recommended by a doctor as
Upheld/Overturned decision about the medical service. IMR review- medically necessary, we expect that the issue of medical
ers do not enter a consultative relationship with the patient, doctor necessity is non-trivial when that specific patient and that
or health plan – they must render an uphold/overturn decision specific treatment are carefully considered.
based solely on the provided medical records. However, as noted • We should therefore expect the text of the IMRs, which justi-
above, they are in an implied relationship of care to the patient, a fies the final determination, to be highly individualized and
point to which we return in the Discussion section below (§4). argue for that final decision (whether congruent with the
While insurance carriers do not provide statistics about the per- health plan’s decision or not) in a way that involves the par-
centage of requested treatments that are denied in the initial stage, ticulars of the treatment and the particulars of the patient’s
looking at the process as a whole, a pattern of service denial aimed medical history and conditions.
Thus, we expect a reasoned, thoughtful IMR to not be highly
2 https://www.maximus.com/capability/appeals-imr generic and templatic / predictable in nature. For instance, legal
KiML’20, August 24, 2020, San Diego, California, USA,
Textual Evidence for the Perfunctoriness of Independent Medical Reviews

documents may be highly templatic as they discuss the application The goal in this paper is to investigate to what extent Natu-
of the same law or policy across many different cases, but a response ral Language Processing (NLP) / Machine Learning (ML) meth-
carefully considering the specifics of a medical case reaching the ods that are able to extract insights from large corpora point in
IMR stage is not likely to be similar to many other cases. We only the same direction, thus mitigating cherry-picking biases that are
expect high similarity and ‘templaticity’ for IMR reviews if they are sometimes associated with qualitative investigations. In addition
reduced to a more or less automatic application of some prespecified to the IMR text, we perform a comparative study with additional
set of rules (rubber-stamping). English-language datasets in an attempt to eliminate data-specific
and problem-specific biases.

• We analyze the text of the IMR reviews and compare them
with a sample of 50,000 Yelp reviews [19] and the corpus of
1.3 Main results, and their limits 50,000 IMDB movie reviews [10].
Concomitantly with this quantitative study, we conducted prelim- • As the size of data has significant consequences for language-
inary qualitative research with a focus on pain management and model training, and NLP/ML models more generally, we
chronic conditions. We investigated the history of the IMR process, expect models trained on the Yelp and IMDB corpora to
in addition to having direct experience with it. We had detailed outperform models trained on the IMR corpus, given that
conversations with doctors in Northern California and on private the IMDB corpus is twice as large as the IMR corpus, and
social media groups formed around chronic conditions and pain the Yelp samples contain almost twice as many reviews.
management. This preliminary research reliably points towards the • In this paper, we instead demonstrate that we were able
possibility that IMR reviews are perfunctory, and that this crucial to construct a very good language model for the IMR cor-
consumer protection mandated by law seems to fail for a sizeable pus using inductive sequential transfer learning, specifically
class of highly vulnerable patients. In this paper, we focus on the ULMFiT [8], as measured by the quality of text generation.
text of the IMR decisions and attempt to quantify the evidence for • In addition, the model achieves a much lower perplexity
the perfunctoriness of the IMR process that they provide. (11.86) and a higher categorical accuracy (0.53) on unseen
The text of the IMR findings does not provide unambiguous test data, compared to models trained on the larger Yelp
evidence about the quality and appropriateness of the IMR process. and IMDB corpora (perplexity: 40.3 and 37, respectively;
If we had access to the full, anonymized patient files submitted to categorical accuracy: 0.29 and 0.39).
the IMR reviewers (in addition to the final IMR decision and the • We see similar trends in topic models [17] and classifica-
associated text), we might have been able to provide much stronger tion models predicting binary IMR outcomes and binarized
evidence that IMRs should have a significantly higher percentage of sentiment for Yelp and IMDB reviews.
overturns, and that the IMR process should be improved in various
ways, e.g., (i) patients should be able to check that all the relevant These results indicate that movie and restaurant reviews ex-
documentation has been collected and will be reviewed, and (ii) hibit a much larger variety, more contentful discussion, and greater
the anonymous reviewers should be held to higher standards of attention to detail compared to IMR reviews. In an attempt to mit-
doctor-patient care. At the very least, one would want to compare igate confirmation bias, as well as potentially significant register
the reports/letters produced by the patient’s doctor(s) and the IMR differences between IMRs and movie or restaurant reviews, we
texts. However, such information is not available and there are no examine four additional corpora: drug reviews [6], data science
visible signs suggesting potential availability in the near future. job postings [9], legal case summaries [5] and cooking recipes [11].
The information that is made available by DMHC constitutes the These specialized-register corpora are potentially more similar to
IMR decision – whether to uphold or overturn the health plan IMRs than IMDB or Yelp: the texts are more likely to be highly
decision –, the anonymized decision letter, and information about similar, include boilerplate text and have a templatic/standardized
the requested treatment category (also available in the letter). We, structure. We find that predictability of IMR texts, as measured by
therefore, had to limit ourselves to the text of the DMHC-provided language-model perplexity and categorical accuracy, is higher than
IMR findings in our empirical analysis. all the comparison datasets by a good margin.
A qualitative inspection of the corpus of IMR decisions made Based on these empirical comparisons, we conclude that we
available by the California DMHC site as of June 2019 (a total of have strong evidence that the IMR reviews are perfunctory and,
26,631 cases spanning the years 2001-2019) indicates that the re- therefore, that a crucial consumer protection mandated by law
views – as documented in the text of the findings – focus more seems to fail for a sizeable class of highly vulnerable patients. The
on the review procedure and associated legalese than on the ac- paper is structured as follows. In Section 2, we discuss the datasets
tual medical history of the patient and the details of the case. For in detail, with a focus on the nature and characteristics of the IMR
example, decisions for chronic pain management seem to mostly data. In Section 3, we discuss the models we use to analyze the IMR,
rubber-stamp the Medical Treatment Utilization Schedule (MTUS) Yelp and IMDB datasets, as well as the four auxiliary corpora (drug
guidelines, with very little consideration of the rarity of the un- reviews, data science jobs, legals cases and recipes). The section also
derlying condition(s) (see our comments about RCTs above), or compares and discusses the results of these models. Section 4 puts
a thoughtful evaluation of the risk/benefit profile of the denied all the results together into an argument for the perfunctoriness of
treatment relative to the specific medical history of the patient the IMRs. Section 5 concludes the paper and outlines directions for
(assuming this history was adequately documented to begin with). future work.
KiML’20, August 24, 2020, San Diego, California, USA,
Brasoveanu, Moodie and Agrawal

2 THE DATASETS Table 2: Outcome counts and percentages by year

2.1 The IMR dataset ReportYear Total # of cases Overturned Upheld
The IMR dataset was obtained from the DMHC website in June 2001 28 7 (25%) 21
20193 and was minimally preprocessed. It contains 26,361 cases / 2002 695 243 (35%) 452
observations and 14 variables, 4 of which are the most relevant: 2003 738 280 (38%) 458
2004 788 305 (39%) 483
• TreatmentCategory: the main treatment category;
2005 959 313 (33%) 646
• ReportYear: year the case was reported; 2006 1080 442 (41%) 638
• Determination: indicates if the determination was upheld or 2007 1342 571 (43%) 771
overturned; 2008 1521 678 (45%) 843
• Findings: a summary of the case findings. 2009 1432 641 (45%) 791
The top 14 treatment categories (with percentages of total ≥ 2%), 2010 1453 661 (45%) 792
together with their raw counts and percentages are provided in 2011 1435 684 (48%) 751
2012 1203 589 (49%) 614
Table 1.
2013 1197 487 (41%) 710
2014 1433 549 (38%) 884
Table 1: Top 14 treatment categories 2015 2079 1070 (51%) 1009
2016 3055 1714 (56%) 1341
TreatmentCategory Case count % of total 2017 2953 1391 (47%) 1562
Pharmacy 6480 25% 2018 2545 1218 (48%) 1327
Diag Imag & Screen 4187 16% 2019 425 209 (49%) 216
Mental Health 2599 10%
DME 1714 7%
Gen Surg Proc 1227 5%
Orthopedic Proc 1173 5%
Rehab/ Svc - Outpt 1157 4%
Cancer Care 1029 4%
Elect/Therm/Radfreq 828 3%
Reconstr/Plast Proc 825 3%
Autism Related Tx 767 3%
Emergency/Urg Care 582 2%
Diag/ MD Eval 573 2%
Pain Management 527 2% Figure 1: % Overturned claimed on DMHC site (June 2019)

The breakdown of cases by patient gender (not recorded for all 2.2 The comparison datasets
cases) is as follows: Female – 14823 (56%), Male – 10836 (41%), Other
As comparison datasets, we use the IMDB movie-review dataset [10],
– 11 (0.0004%).
which has 50,000 reviews and a binary positive/negative sentiment
The breakdown by determination (the outcome of the IMR) is:
classification associated with each review. This dataset will be par-
Upheld – 14309 (54%), Overturned – 12052 (46%).
ticularly useful as a baseline for our ULMFiT transfer-learning
The outcome counts and percentages by year are provided in
language models (and subsequent transfer-learning classification
Table 2. The number of cases for 2019 include only the first 5 months
models), where we show that we obtain results for the IMDB dataset
of the year plus a subset of June 2019.
that are similar to the ones in the original ULMFiT paper [8].
Interestingly, the DMHC website featured a graphic in June 2019
There are 50,000 movie reviews in the IMDB dataset, evenly split
(Figure 1) that reports the percentage of Overturned outcomes to be
into negative and positive reviews. The histogram of text lengths
64%, a figure that does not accord with any of our data summaries.
for IMDB reviews is provided in Figure 2. The reviews contain a
We intend to follow up on this issue and see if the DMHC can share
total of 11,557,297 words. The mean length of a review is 231.15
their data-analysis pipeline so that we can pinpoint the source(s)
words, with an SD of 171.32.
of this difference.
We select a sample of 50,000 Yelp (mainly restaurant) reviews [19],
Given that our main goal here is to investigate the text of the
with associated binarized negative/positive evaluations, to provide
IMR findings and its predictiveness with respect to IMR outcomes,
a comparison corpus intermediate between our DMHC dataset and
we provide some general properties of this corpus. The histogram
the IMDB dataset. From a total of 560,000 reviews (evenly split be-
of word counts for the IMR findings (the text associated with each
tween negative and positive), we draw a weighted random sample
case) is provided in Figure 2. There are 26,361 texts, with a total of
with the weights provided by the histogram of text lengths for the
5,584,280 words. Words are identified by splitting texts on white
IMR corpus. The resulting sample contains 25,809 (52%) negative
space (sufficient for our purposes here). The mean length of a text
reviews and 24,191 (48%) positive reviews. The histogram of text
is 211.84 words, with a standard deviation (SD) of 120.58.
lengths for Yelp reviews is also provided in Figure 2. The reviews
3 https://data.chhs.ca.gov/dataset/independent-medical-review-imr-determinations- contain a total of 7,038,467 words. The mean length of a review is
trend. 140.77 words, with an SD of 71.09.
KiML’20, August 24, 2020, San Diego, California, USA,
Textual Evidence for the Perfunctoriness of Independent Medical Reviews

(Normalized) # of texts of a given length

(Normalized) # of texts of a given length
0.004 0.008
0.005

0.003 0.004 0.006
0.003
0.002 0.004
0.002
0.001 0.002
0.001

0.000 0.000 0.000
0 200 400 600 800 1000 1200 0 500 1000 1500 2000 2500 0 200 400 600 800
IMR text length (# of words) IMDB-review text length (# of words) Yelp-review text length (# of words)

(a) IMR (b) IMDB (c) Yelp

Figure 2: Histograms of text lengths (numbers of words per text) for the IMR, IMDB and Yelp corpora
(Normalized) # of texts of a given length

(Normalized) # of texts of a given length

(Normalized) # of texts of a given length
0.010 0.0020 0.00014
0.007
0.00012
0.008 0.006
0.0015 0.00010
0.005
0.006 0.00008
0.0010 0.004
0.004 0.00006 0.003
0.0005 0.00004 0.002
0.002
0.00002 0.001
0.000 0.0000 0.00000 0.000
0 250 500 750 1000 1250 1500 1750 0 500 1000 1500 2000 2500 3000 3500 0 20000 40000 60000 80000 0 500 1000 1500 2000 2500
Drug-review text length (# of words) DS-job text length (# of words) Legal-case text length (# of words) Recipe text length (# of words)

(a) Drug Reviews (b) DS Jobs (c) Legal cases (d) Recipes

Figure 3: Histograms of text lengths (numbers of words per text) for the auxiliary datasets

2.3 Four auxiliary datasets The histogram of text lengths for drug reviews is provided in
We will also analyze four other specialized-register corpora: drug Figure 3. The reviews contain a total of 11,015,248 words, with a
reviews [6], data science (DS) job postings [9], legal case reports [5] mean length of 83.26 words per review (significantly shorter than
and cooking recipes [11]. The modeling results for these specialized- the IMR/IMDB/Yelp texts) and an SD of 45.73.
register corpora will enable us to better contextualize and evaluate The DS corpus includes 6,953 job postings (about a quarter of
the modeling results for the IMR, IMDB and Yelp corpora, since the texts in the IMR corpus), with a total of 3,731,051 words. The
these four auxiliary datasets might be seen as more similar to the histogram of text lengths is provided in Figure 3. The mean length
IMR corpus than movie or restaurant reviews. The drug-review of a job posting is 536.61 words (more than twice as long as the
corpus contains reviews of pharmaceutical products, which are IMR/IMDB/Yelp texts), with an SD of 254.06.
closer in subject matter to IMRs than movie/restaurant reviews. There are 3,890 legal-case reports (even fewer than DS job post-
The other three corpora are all highly specialized in register, just ings), with a total of 25,954,650 words (about 5 times larger than
like the IMRs, with two of them (DS jobs and legal cases) particularly the IMR corpus). The histogram of text lengths for the legal-case re-
similar to the IMRs in that they involve templatic texts containing ports is provided in Figure 3. The mean length of a report is 6,672.15
information aimed at a specific professional sub-community. words (a degree of magnitude longer than IMR/IMDB/Yelp), with a
These four corpora are very different from each other and from very high SD of 11,997.98.
the IMR corpus in terms of (i) the number of texts that they contain Finally, the recipe corpus includes more than 1 million texts:
and (ii) the average text length (number of words per text). Because there are 1,029,719 recipes, with a total of 117,563,275 words (very
of this, there was no obvious way to sample from them and from large compared to our other corpora). The histogram of text lengths
the IMR, IMDB and Yelp corpora in such a way that the resulting for the recipes is provided in Figure 3. The mean length of a recipe
samples were both roughly comparable with respect to the total is 114.17 words (close to the length of a drug review, and roughly
number of texts and average text length, and also large enough to half of an IMR), with an SD of 90.54.
obtain reliable model estimates. We therefore analyzed these four
corpora as a whole. 3 THE MODELS
The drug-review corpus includes 132,300 drugs reviews – more In this section, we analyze the text of the IMR findings and its
than the double the number of texts in the IMDB and Yelp datasets, predictiveness with respect to IMR outcomes. We systematically
and more than 4 times the number of texts in the IMR dataset. From compare these results with the corresponding ones for the IMDB
the original corpus of 215,063 reviews, we only retained the reviews and Yelp corpora. The datasets were split into training (80%), vali-
associated with a rating of 10, which we label as positive reviews, dation (10%) and test (10%) sets. Test sets were only used for the
and a rating of 1 through 5, which we label as negative reviews.4 final model evaluation.
4 We did this so that we have a fairly balanced dataset (68,005 positive drug reviews and
64,295 negative reviews) to estimate classification models like the ones we report for
the IMR, IMDB and Yelp corpora in the next section. For completeness, the drug-review accuracy: 77.89%; accuracy of multilayer perceptron with a 1,000-unit hidden layer
classification results on previously unseen test data are as follows: logistic regression and a ReLU non-linearity: 83.18%; ULMFiT classification model accuracy: 96.12%.
KiML’20, August 24, 2020, San Diego, California, USA,
Brasoveanu, Moodie and Agrawal

We start with baseline classification models (logistic regressions We see that the text of the findings / reviews is highly predictive
and logistic multilayer perceptrons with one hidden layer) to es- of the associated binary outcomes, with the highest accuracy for the
tablish that the reviews in all three datasets under consideration IMR dataset despite the fact that it contains half the observations
are highly predictive of the associated binary outcomes. Once the of the other two data sets. We can therefore turn to a more in-
predictiveness, hence, relevance, of the text is established, we turn depth analysis of the texts to understand what kind of textual
to an in-depth analysis of the texts themselves by means of topic justification is used to motivate the IMR binary decisions. To that
and language models. We see that the text of the IMR reviews is end, we examine and compare the results of two unsupervised/self-
significantly different (more predictable, less diverse / contentful) supervised types of models: topic models and language models.
when compared to movie and restaurant reviews. We then turn to
a final set of classification models that leverage transfer learning 3.2 Topic models
from the language models to see how predictive the texts can re-
Topic modeling [17] is an unsupervised method that distills se-
ally be with respect to the associated binary outcomes. Finally, we
mantic properties of words and documents in a corpus in terms of
report the results of estimating language models for the 4 auxiliary
probabilistic topics. The most widespread measure for topic model
datasets introduced in the previous section.
evaluation is the coherence score [14]. Typically, as we increase
The main conclusion of this extensive series of models is that
the number of topics from very few, say, 4 topics, to more of them,
the IMR corpus is an outlier, and it would be easy to make the
we see an increase in coherence score that tends to level out after
IMR process fully automatic: it is pretty straightforward to train
a certain number of topics. When modeling the IMDB and Yelp
models that generate high-quality, realistic IMR reviews and gen-
datasets, we see exactly this behavior, as shown in Figure 4.
erate binary decisions that are very reliably associated with these
In contrast, the 4-topic model has the highest coherence score
reviews. In contrast, movie and restaurant reviews produced by
(0.56) for the IMR data set, also shown in Figure 4. Furthermore,
unpaid volunteers (as well as the 4 auxiliary datasets) exhibit more
as we add more topics, the coherence score drops. As the word
human-like depth, sophistication and attention to detail, so current
clouds for the 4-topic model in Figure 5 show, these 4 topics mostly
NLP models do not perform as well on them.
reflect the legalese associated with the IMR review procedure and
very little, if anything, of the treatments and conditions that were
3.1 Classification models the main point of the review. In contrast, the corresponding high-
We regress outcomes (Upheld/Overturned for IMR or negative/positive scoring topic models for the IMDB and Yelp datasets reflect actual
sentiment for IMDB/Yelp) against the text of the corresponding features of movies, e.g., family-life movies, westerns, musicals etc.,
findings / reviews. For the purposes of these basic classification or breakfast/lunch places, restaurants, shops, bars, hotels etc.
models, as well as the topics models discussed in the following sub- Recall that IMRs are the legally-mandated last resort for patients
section, the texts were preprocessed as follows. First, we removed seeking treatments (usually) ordered by their doctors, and which
stop words; for the IMR dataset, we also removed the following their health plan refuses to cover. The reviews are conducted ex-
high-frequency words: patient, treatment, reviewer, request, medi- clusively based on documentation. Putting aside the fact that it is
cal and medically, and for the IMDB dataset, we also removed the unclear how much effort is taken to ensure that the documentation
words film and movie. After part-of-speech tagging, we retained is complete, especially for patients with extensive and complicated
only nouns, adjectives, verbs and adverbs, since lexical meanings health records, we see that relatively little specific information
provide the most useful information for logistic (more generally, about a patients’ medical history, condition(s), or the recommended
feed-forward) models and topic models. The resulting dictionary treatments are reflected in the text of these decisions. The text seems
for the IMR dataset had 23,188 unique words. We ensured that to consist largely of legalese about the IMR process, the health plan
the dictionaries for the IMDB and Yelp datasets were also between / providers, basic demographic information about the patient, and
23,000 and 24,000 words by eliminating infrequent words. Bounding generalities about the medical service or therapy requested for the
the dictionaries for each dataset to a similar range helps mitigate enrollee’s condition.
dataset-specific modeling biases: having differently-sized vocabu-
laries leads to differently-sized parameter spaces for the models.
3.3 Language models with transfer learning
We extracted features by converting each text into sparse bag-of-
words vectors of dictionary length, which recorded how many times Language models, specifically using neural networks, are usually
each token occurred in the text. These feature representations were recurrent-network or transformer based architectures designed
the input to all the classifier models we consider in this subsection. to learn textual distributional patterns in an unsupervised or self-
The multilayer perceptron model had a single hidden layer with supervised manner. Recurrent-network models – on which we
1,000 units and a ReLU non-linearity. The classification accuracies focus here – commonly use Long Short-Term Memory (LSTM) [7]
on the test data for all three datasets are provided in Table 3. “cells,” which are able to learn long-term dependencies in sequences.
Representing text as a sequence of words, language models build
rich representations of the words, sentences, and their relations
Table 3: Classification accuracy for basic models
within a certain language. We estimate a language model for the
IMR corpus using inductive sequential transfer learning, specifically
IMR IMDB Yelp ULMFiT [8]. Just as [8], we use the AWD-LSTM model [12], a vanilla
logistic regression 90.75% 86.30% 87.62% LSTM with 4 kinds of dropout regularization, embedding size of
multilayer perceptron 90.94% 87.14% 88.92% 400, 3 LSTM layers (1,150 units per layer), and a BPTT of size 70.
KiML’20, August 24, 2020, San Diego, California, USA,
Textual Evidence for the Perfunctoriness of Independent Medical Reviews

Coherence scores 0.40 Coherence scores 0.510 Coherence scores
0.55 0.505
0.38 0.500
0.54
Coherence score

Coherence score

Coherence score
0.495
0.36 0.490
0.53
0.485
0.52 0.34 0.480
0.475
0.51 0.32 0.470
4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20
Num Topics Num Topics Num Topics

(a) IMR (b) IMDB (c) Yelp

Figure 4: Coherence scores for topic models (𝑥-axis: number of topics; 𝑦-axis: coherence score)

for treatment of the patient ’s behavioral health condition
. The American Psychiatric Association ( APA ) treatment
guidelines for patients with eating disorders also consider
PHP acute care to be the most appropriate setting for treat-
ment , and suggest that patients should be treated in the least
restrictive setting which is likely to be safe and effective .
The PHP was initially recommended for patients who were
based on their own medical needs , but who were
• The patient was admitted to a skilled nursing facility (
SNF ) on 12 / 10 / 04 . The submitted documentation states
the patient was discharged from the hospital on 12 / 22 /
04 . The following day the patient ’s vital signs were sta-
ble . The patient had been ambulating to the community
with assistance with transfers , but has not had any recent
medical or rehabilitation therapy . The patient had no new
medical problems and was discharged in stable condition .
Figure 5: Word clouds for the 4-topic IMR model The patient has requested reimbursement for the inpatient
acute rehabilitation services provided
We see that the IMR language model is highly performant, de-
The AWD-LSTM model is pretrained on Wikitext-103 [13], con-
spite the simple model architecture we used, the modest size of
sisting of 28, 595 preprocessed Wikipedia articles, with a total of 103
the pretraining corpus, and the small size of the IMR corpus. The
million words. This pretrained model is fairly simple (no attention,
quality of the generated text is also very high, particularly given
skip connections etc.), and the pretraining corpus is of modest size.
all these limitations.
To obtain our final language models for the IMR, IMDB and
Yelp corpora, we fine-tune the pretrained AWD-LSTM model using
discriminative [18] and slanted triangular [8, 16] learning rates. We 3.4 Classification with transfer learning
do the same kind of minimal text preprocessing as in [8]. We further fine-tune the language models discussed in the previous
The perplexity and categorical accuracy for the 3 language mod- subsection to train classifiers for the three datasets. Following [4, 8],
els are provided in Table 4. The perplexity for the IMR findings is we gradually unfreeze the classifier models to avoid catastrophic
much lower than for the IMDB / Yelp reviews, and the language forgetting.
model can correctly guess the next word more than half the time. The results of evaluating the classifiers on the withheld test
sets are provided in Table 5. Despite the fact that the IMR dataset
Table 4: Language-model perplexity and categ. accuracy contains half of the classification observations of the other two
datasets, we obtain the highest level of accuracy when predicting
IMR IMDB Yelp binary Upheld/Overturned decisions based on the text of the IMR
perplexity 11.86 36.96 40.3 findings.
categorical accuracy 53% 39% 29%
Table 5: Accuracy for transfer-learning classifiers
The IMR language model can generate high quality and largely
coherent text, unlike the IMDB / Yelp models. Two samples of IMR IMDB Yelp
generated text are provided below (the ‘seed’ text is boldfaced). classification accuracy 97.12% 94.18% 96.16%
• The issue in this case is whether the requested partial hos-
pitalization program ( PHP ) services are medically necessary
KiML’20, August 24, 2020, San Diego, California, USA,
Brasoveanu, Moodie and Agrawal

Table 6: Comparison of language models across all datasets. 40
Best performing metrics are boldfaced. 50
35

Categorical accuracy (%)
45
Dataset Perplexity Categorical Accuracy 30

Perplexity
IMR reviews 11.86 0.53 25 40
Legal cases 18.17 0.43
DS Jobs 22.14 0.41 20 35
Drug reviews 25.06 0.36 15
30
Recipes 29.56 0.39
IMDB 36.96 0.39

lp
s
se

iew
job
IM

Ye
cip

IM
ca

rev
DS

Re
Yelp 40.3 0.29

l
ga

ug
Le

Dr
Corpora

3.5 Models for auxiliary corpora Figure 6: Comparison of language-model perplexity and cat-
We also estimated topic and language models for the 4 auxiliary egorical accuracy across all the datasets.
corpora (drug reviews, DS jobs, legal cases and cooking recipes).
The associations between coherence scores and number of topics
for these 4 corpora was similar to the ones plotted in Figure 4 above these medical reviews to be so much more predictable and generic
for the IMDB and Yelp corpora. For all 4 auxiliary corpora, the best than less socially consequential reviews of movies and restaurants.
topic models had at least 14 topics, often more, with coherence What are the ethical and potentially legal consequences of these
scores above 0.5. The quality of the topics was also high, with findings? First, while state legislators assume we have strong health-
intuitively coherent and contentful topics (just like IMDB / Yelp). insurance related consumer protections in place, an image DMHC
The perplexity and accuracy of the ULMFiT language models goes to great lengths to promote, we find the reviews to be up-
on previously-withheld test data are provided in Table 6, which holding insurance plan denials at rates that exceed what one might
contains the results for all the 7 datasets under consideration in expect, given that the treatments in question are frequently being
this paper. We see that the predictability of the IMR corpus, as ordered by a treating physician, and that the IMR process is the last
reflected in its perplexity and categorical accuracy scores, is still stage in a bureaucratically laborious (hence high-attrition) process
clearly higher than the 4 auxiliary corpora. The perplexity of the of appealing health-plan denials.
legal-case corpus (18.17) is somewhat close to the IMR perplexity Second, given that the IMR process creates an implied relation
(11.86), but we should remember that the legal-case corpus is about of care between the reviewers hired by MAXIMUS and the patient –
5 times larger than the IMR corpus. Furthermore, the legal-case since reviewers are, after all, being entrusted with the best interests
categorical accuracy of 43% is still substantially lower than the IMR of the patient without regard to cost –, one can hardly say that they
accuracy of 53%. Notably, even the recipe corpus, which is about 20 are fulfilling their obligations as doctors to their patient with such
times larger than the IMR corpus (≈ 117.5 vs. ≈ 5.5 million words) seemingly rote, perfunctory reviews.
does not have test-set scores similar to the IMR scores. Third, if IMR processes were designed to make sure that (i) treat-
The results for these 4 auxiliary corpora indicate that the IMR ment decisions are being made by doctors, not by profit-driven
corpus is an outlier, with very highly templatic and generic texts. businesses, and (ii) insurance companies cannot welch on their re-
sponsibilities to plan members, one must wonder whether prescrib-
4 DISCUSSION ing physicians are wrong more than half the time. Do American
The models discussed in the previous section show that language- doctors really order so many erroneous, medically unnecessary
model learning is significantly easier for IMRs compared to the other treatments and medications? If so, how is it possible that they are
6 corpora. As can be seen in Table 6, perplexity in the language so committed and confident in them that they are willing to escalate
model for IMR reviews is clearly lower than even legal cases, for the appeal process all the way to the state-managed IMR stage?
which we expect highly templatic language and high similarity Or is it that IMRs often serve as a final rubber stamp for health-
between texts. This pattern can be clearly observed in Figure 6, insurance plan denials, failing their stated mission of protecting a
with the IMR corpus clearly at the very end of the high-to-low vulnerable population?
predictability spectrum. We end this discussion section by briefly reflecting on the way
One would not expect such highly predictable texts in an ideal we used ML/NLP methods for social good problems in this paper.
scenario, where each medical review is thorough, and each deci- Overwhelmingly, the social-good applications of these methods
sion is accompanied by strong medical reasoning relying on the and models seem to be predictive in nature: their goal is to improve
specifics of the case at hand, and based on an objective physician’s, the outcomes of a decision-making process, and the improvement
or team of physicians’, opinion as to what is in the patient’s best is evaluated according to various performance-related metrics. An
interest. Arguably, these medically complex cases are as diverse as important class of metrics that are currently being developed have
Hollywood blockbusters or fashionable restaurants – the patients to do with ethical, or ‘safe,’ uses of ML/AI models.
themselves certainly experience them as unique and meaningful In contrast, our use of ML models in this paper was analytical,
–, and their reviews should be similarly diverse, or at most as tem- with the goal of extracting insights from large datasets that enable
platic as a job posting or a cooking recipe. We wouldn’t expect us to empirically evaluate how well an established decision-making
KiML’20, August 24, 2020, San Diego, California, USA,
Textual Evidence for the Perfunctoriness of Independent Medical Reviews

process with high social impact functions. Data analysis of this limited to (i) adding ways for patients to check that all the rele-
kind, more akin to hypothesis testing than to predictive modeling, vant documentation has been collected and will be reviewed, and
is in fact one of the original uses of statistical models / methods. (ii) identifying ways to hold the anonymous reviewers to higher
Unfortunately, using ML models in this way does not straightfor- standards of doctor-patient care.
wardly lead to plots showing how ML models obviously improve
metrics like the efficiency or cost of a process. We think, however, ACKNOWLEDGMENTS
that there are as many socially beneficial opportunities for this kind We are grateful to four KDD-KiML anonymous reviewers for their
of data-analysis use of ML modeling as there are for its predictive comments on an earlier version of this paper. We gratefully acknowl-
uses. The main difference between them seems to be that the data- edge the support of the NVIDIA Corporation with the donation of
analysis uses do not lead to more-or-less immediately measurable two Titan V GPUs used for this research, as well as the UCSC Office
products. Instead, they are meant to become part of a larger ar- of Research and The Humanities Institute for a matching grant to
gument and evaluation of a socially and politically relevant issue, purchase additional hardware. The usual disclaimers apply.
e.g., the ethical status of current health-insurance related practices
and consumer protections discussed here. What counts as ‘success’ REFERENCES
when ML models are deployed in this way is less immediate, but [1] Leatrice Berman-Sandler. 2004. Independent Medical Review: Expanding Legal
could provide at least as much social good in the long run. Remedies to Achieve Managed Care Accountability. Annals Health Law 13 (2004).
[2] Kenneth H. Chuang, Wade M. Aubry, and R. Adams Dudley. 2004. Independent
Medical Review Of Health Plan Coverage Denials: Early Trends. Health Affairs
23, 6 (2004), 163–169. https://doi.org/10.1377/hlthaff.23.6.163
5 CONCLUSION AND FUTURE WORK [3] Angus Deaton and Nancy Cartwright. 2018. Understanding and misunderstand-
ing randomized controlled trials. Social Science and Medicine 210 (2018), 2–21.
We examined a database of 26,361 IMRs handled by the California https://doi.org/10.1016/j.socscimed.2017.12.005
DMHC through a private contractor. IMR processes are meant to [4] Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann.
2017. Using millions of emoji occurrences to learn any-domain representations for
provide protection for patients whose doctors prescribe treatments detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on
that are denied by their health insurance. Empirical Methods in Natural Language Processing. Association for Computational
We found that, in a majority of cases, IMRs uphold the health Linguistics, Copenhagen, Denmark, 1615–1625. https://doi.org/10.18653/v1/D17-
1169
insurance denial, despite DMHC’s claim to the contrary. In addition, [5] Filippo Galgani and Achim Hoffmann. 2011. LEXA: Towards Automatic Legal
we analyzed the text of the reviews and compared them with a Citation Classification. In AI 2010: Advances in Artificial Intelligence, Jiuyong Li
sample of 50,000 Yelp reviews and the IMDB movie review corpus. (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 445–454.
[6] Felix Gräundefineder, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder.
Despite the fact that these corpora are basically twice as large, we 2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain
can construct a very good language model for the IMR corpus, and Cross-Data Learning (DH ’18). Association for Computing Machinery, New
York, NY, USA, 121–125. https://doi.org/10.1145/3194658.3194677
as measured by the quality of text generation, as well as its low [7] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.
perplexity and high categorical accuracy on unseen test data. These Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.
results indicate that movie and restaurant reviews exhibit a much 8.1735
[8] Jeremy Howard and Sebastian Ruder. 2018. Fine-tuned Language Models for Text
larger variety, more contentful discussion, and greater attention Classification. CoRR abs/1801.06146 (2018). arXiv:1801.06146 http://arxiv.org/
to detail compared to IMR reviews, which seem highly templatic abs/1801.06146
and perfunctory in comparison. We see similar trends in topic [9] Shanshan Lu. 2018. Data Scientist Job Market in the U.S. https://www.kaggle.
com/sl6149/data-scientist-job-market-in-the-us More info available here: https:
models and classification models predicting binary IMR outcomes //github.com/Silvialss/projects/tree/master/IndeedWebScraping.
and binarized sentiment for Yelp and IMDB reviews. [10] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng,
and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis (HLT
These results were further confirmed by topic and language ’11). Association for Computational Linguistics, Stroudsburg, PA, USA, 142–150.
models for four other specialized-register corpora (drug reviews, [11] Javier Marin, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, Yusuf
data science job postings, legal-case reports and cooking recipes). Aytar, Ingmar Weber, and Antonio Torralba. 2019. Recipe1M+: A Dataset for
Learning Cross-Modal Embeddings for Cooking Recipes and Food Images. IEEE
We are in the process of extending our datasets with (i) workers’ Trans. Pattern Anal. Mach. Intell. (2019).
comp cases from California and (ii) private insurance cases from [12] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing
other states. This will enable us to investigate if the reviews for and Optimizing LSTM Language Models. CoRR abs/1708.02182 (2017).
[13] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017.
workers’ comp cases are substantially different from the DMHC Pointer Sentinel Mixture Models. CoRR abs/1609.07843 (2017).
IMR data (the percentage of upheld decisions is much higher for [14] Michael Röder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the
Space of Topic Coherence Measures (WSDM ’15). ACM, New York, NY, USA,
workers’ comp: ≈ 90%), as well as if the reviews vary substantially 399–408. https://doi.org/10.1145/2684822.2685324
across states. [15] Shirley Eiko Sanematsu. 2001. Taking a broader view of treatment disputes
Another direction for future work is to follow up on our pre- beyond managed care: Are recent legislative efforts the cure? UCLA Law Review
48 (2001).
liminary qualitative research with a survey of patients that have [16] Leslie N. Smith. 2017. Cyclical learning rates for training neural networks. In
experienced the IMR process to see if these patients agree with the Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE.
DMHC-promoted message that the IMR process provides strong 464–472.
[17] Mark Steyvers and Tom Griffiths. 2007. Probabilistic Topic Models. Lawrence
consumer protection against unjustified health-plan denials. This Erlbaum Associates.
could also enable us to verify if the medical documentation col- [18] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transfer-
able are features in deep neural networks?. In Advances in Neural Information
lected during the IMR process is complete and actually taken into Processing Systems. 3320–3328.
account when the decision is made. [19] Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Con-
The ultimate upshot of this project would be a list of recommen- volutional Networks for Text Classification. CoRR abs/1509.01626 (2015).
arXiv:1509.01626 http://arxiv.org/abs/1509.01626
dations for the improvement of the IMR process, including but not
Knowledge Intensive Learning of
Generative Adversarial Networks
Devendra Singh Dhami Mayukh Das Sriraam Natarajan
devendra.dhami@utdallas.edu Samsung Research India The University of Texas at Dallas
The University of Texas at Dallas mayukh.das@samsung.com sriraam.natarajan@utdallas.edu
ABSTRACT We aim to address the above limitations. Inspired by Mitchell’s
While Generative Adversarial Networks (GANs) have accelerated argument of “The Need for Biases in Learning Generalizations” [38],
the use of generative modelling within the machine learning com- we mitigate the challenges of existing data hungry methods via in-
munity, most of the applications of GANs are restricted to images. ductive bias while learning GANs. We show that effective inductive
The use of GANs to generate clinical data has been rare due to the bias can be provided by humans in the form of domain knowl-
inability of GANs to faithfully capture the intrinsic relationships edge [14, 27, 41, 50]. Rich human advice can effectively balance
between features. We hypothesize and verify that this challenge can the impact of quality (sparsity) of training data. Data quality also
be mitigated by incorporating domain knowledge in the generative contributes to, the well studied, modal instability of GANs. This
process. Specifically, we propose human-allied GANs that using problem is especially critical in domains such as medical/clinical
correlation advice from humans to create synthetic clinical data. Our analytics that does not typically exhibit ‘spatial homophily’ [21], un-
empirical evaluation demonstrates the superiority of our approach like images, and are prone to distributional diversity among feature
over other GAN models. clusters as well. Our human-guided framework proposes a robust
strategy to address this challenge. Note that in our setting the human
CCS CONCEPTS is an ally and not an adversary.
The second limitation of access is crucial for medical data gener-
• Deep Learning → Generative Adversarial Networks; • Ap-
ation. Access to existing medical databases [10, 18] is hard due to
plication → Healthcare; • Learning → Knowledge Intensive Learn-
cost and access concerns and thus synthetic data generation holds
ing.
tremendous promise [6, 13, 19, 35, 48]. While previous methods
KEYWORDS generated synthetic images, we go beyond images and generate clin-
generative adversarial networks, human in the loop, healthcare ical data. Building on this body of work, we present a synthetic data
ACM Reference Format:
generation framework that effectively exploits domain expertise to
Devendra Singh Dhami, Mayukh Das, and Sriraam Natarajan. 2020. Knowl- handle data quality.
edge Intensive Learning of Generative Adversarial Networks. In Proceedings We make a few key contributions:
of KDD Workshop on Knowledge-infused Mining and Learning (KiML’20). ,
6 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn (1) We demonstrate how effective human advice can be provided
to a GAN as an inductive bias.
1 INTRODUCTION (2) We present a method for generating data given this advice.
(3) Finally, we demonstrate the effectiveness and efficacy of our
Deep learning models have reshaped the machine learning landscape
approach on 2 de-identified clinical data sets. Our method
over the past decade [16, 29]. Specifically, Generative Adversar-
is generalizable to multiple modalities of data and is not
ial Networks (GANs) [17] have found tremendous success in gen-
necessarily restricted to images.
erating examples for images [34, 37, 45], photographs of human
(4) Yet another feature of this approach is that training occurs
faces [1, 25, 52], image to image translation [30, 33, 55] and 3D
from very few data samples (< 50 in one domain) thus pro-
object generation [44, 51, 53] to name a few. Despite such success,
viding human guidance as a data generation alternative.
there are several key factors that limit the widespread adoption of
GANs, for a broader range of tasks, including, widely acknowledged
data hungry nature of such methods, potential access issues of real 2 RELATED WORK
medical data and finally, their restricted usage, mainly in the con- The key principle behind GANs [17] is a zero-sum game [26] from
text of images. These factors have limited the use of these arguably game theory, a mathematical representation where each participant’s
successful techniques in medical (or similar) domains. However, gain or loss is exactly balanced by the losses or gains of the other
recently, synthetic data generation has become a centerpiece of re- participants and is generally solved by a minimax algorithm. The
search in medical AI due to the diverse difficulties in collection, generator distribution 𝑝𝑑𝑎𝑡𝑎 (𝒙) over the given data 𝒙 is learned by
persistence, sharing and analysis of real clinical data. sampling 𝒛 from a random distribution 𝑝 𝒛 (𝒛) (initially uniform was
proposed but Gaussians have been proven superior [2]). While GANs
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, have proven to be a powerful framework for estimating generative
California, USA, August 24, 2020. Use permitted under Creative Commons License distributions, convergence dynamics of naive mini-max algorithm
Attribution 4.0 International (CC BY 4.0).
has been shown to be unstable. Some recent approaches, among
KiML’20, August 24, 2020, San Diego, California, USA,
© 2020 Copyright held by the author(s). many others, augment learning either via statistical relationships be-
https://doi.org/10.1145/nnnnnnn.nnnnnnn tween true and learned generative distributions such as Wasserstein-1
KiML’20, August 24, 2020, San Diego, California, USA,
Devendra Singh Dhami, Mayukh Das, and Sriraam Natarajan

distance [3], MMD [32] or via spectral normalization of the parame- on defining a distance/divergence (Wasserstein or earth movers dis-
ter space of the generator [39] which controls the generator distribu- tance) to measure the closeness between the real distribution and the
tion from drifting too far. Although these approaches have improved model distribution.
the GAN learning in some cases, there is room for improvement.
Guidance via human knowledge is a provably effective way to
control learning in presence of systematic noise (which leads to 3.1 Human input as inductive bias
instability). One typical strategy to incorporate such guidance is Historically, two approaches have been studied for using guidance
by providing rules over training examples and features. Some of as bias. The first is to provide advice on the labels as constraints
the earliest approaches are explanation-based learning (EBL-NN, or preferences that controls the search space. Some example advice
[49]) or ANNs augmented with symbolic rules (KBANN, [50]). Var- rules on the labels include: (3 ≤ feature1 ≤ 5) ⇒ label = 1 and (0.6
ious widely-studied techniques of leveraging domain knowledge ≤ feature2 ≤ 0.8) ∧ (4 ≤ feature3 ≤ 5) ⇒ label = 0. Such advice
for optimal model generalization include polyhedral constraints in is more relevant in an discriminative setting but are not ideal for
case of knowledge-based SVMs, [9, 14, 28, 47]), preferences rules GANs. Since GANs are shown to be sensitive to the training data
[5, 27, 41, 42] or qualitative constraints (ex: monotonicities / syner- and here the labels are getting generated, they should not be altered
gies [54] or quantitative relationships [15]). Notably, whereas these during training. The second is via correlations between features as
models exhibit considerable improvement with the incorporation of preferences (our approach) which allows for faithful representation
human knowledge, there is only limited use of such knowledge in of diverse modality.
training GANs. Our approach resembles the qualitative constraints Advice injection: After every fixed number of iterations, N, we
framework in spirit. calculate the correlation matrix of the generated data G1 and provide
While widely successful in building optimally generalized models a set of advice 𝜓 on the correlations between different features. Con-
in presence of systematic noise (or sample biases), knowledge-based sider the following motivating example for the use of correlations as
approaches have mostly been explored in the context of discrimi- a form of advice.
native modeling. In the generative setting, a recent work extends Example: Consider predicting heart attack with 3 features - choles-
the principle of posterior regularization from Bayesian modeling to terol, blood pressure (BP) and income. The values of the given
deep generative models in order to incorporate structured domain features can vary (sometimes widely) between different patients due
knowledge [22]. Traditionally, knowledge based generative learning to several latent factors (ex, smoking habits). It is difficult to assume
has been studied as a part of learning probabilistic graphical models any specific distribution. In other words, it is difficult to deduce
with structure/parameter priors [36]. We aim to extend the use of whether the values for the features come from the same distribution
knowledge to the generative model setting. (even though the feature values in the data set are similar).
We modify the correlation coefficients (for both positive and neg-
3 KNOWLEDGE INTENSIVE LEARNING OF ative correlations) between the features by increasing them if the
GENERATIVE ADVERSARIAL NETWORKS human advice suggests that two features are highly correlated and
decrease the same if the advice suggests otherwise.
A notable disadvantage of adversarial training formulation is that Example: Continuing the above example, since rise in the choles-
the training is slow and unstable, leading to mode collapse [2] where terol level can lead to rise in BP and vice versa, expert advice here
the generator starts generating data of only a single modality. This can suggest that cholesterol and BP should be highly correlated.
has resulted in GANs not being exploited to their full potential in Also, as income may not contribute directly to BP and cholesterol
generating synthetic non-image clinical data. Human advice can levels, another advice here can be to de-correlate cholesterol/BP
encourage exploration in diverse areas of the feature space and helps and income level.
learn more stable models [43]. Hence, we propose a human-allied The example advice rules ∈ 𝜓 are: 1. Correlation(“cholesterol
GAN architecture (HA-GAN) (figure 1). The architecture incorpo- level",“BP")↑, 2. Correlation(“cholesterol level",“income level")↓
rates human advice in form of feature correlations. Such intrinsic and 3. Correlation(“BP",“income level")↓, where ↑ and ↓ indicate
relationships between the features are crucial in medical data sets increase and decrease respectively. Based on the 1st advice we need
and thus become a natural candidate as additional knowledge/advice to increase the correlation coefficient between cholesterol level and
in guided model learning for faithful data generation. BP. Then
Our approach builds upon a GAN architecture [17] where a ran-
dom noise vector is provided to the generator which tries to generate
examples as close to the real distribution as possible. The discrimi- 1 0.2 0.3  1 𝜆 1
 
nator tries to distinguish between real examples and ones generated C = 0.2 1 0.07 A = 𝜆 1 1 (1)
by the generator. The generator tries to maximize the probability 0.3 0.07 1  1 1 1
 
that the discriminator makes a mistake and the discriminator tries to
minimize its mistakes thereby resulting in a min-max optimization
problem which can be solved by a mini-max algorithm. We adopt Here C is the correlation matrix, A is the advice matrix and 𝜆 is the
the Wasserstein GAN (WGAN) architecture1 [3, 20] that focuses factor by which the correlation value is to be augmented. In case
where we need to increase the value of the correlation coefficient, 𝜆
should be > 1. We keep 𝜆 = 𝑚𝑎𝑥 1( | C |) . Since -1.0 ≤ ∀𝑐 ∈ C ≤ 1.0,
1 We use ‘GAN’ to indicate ‘W-GAN’ in this case, the value of 𝜆 ≥ 1.0, leading to enhanced correlation via
Knowledge Intensive Learning of KiML’20, August 24, 2020, San Diego, California, USA,
Generative Adversarial Networks

Figure 1: Human-Allied GAN. Correlation advice takes generated distribution closer to the real distribution.

Hadamard product. Thus the new correlation matrix Ĉ is, function. For a sampled point 𝑣, CDF (𝑣) = P (𝑉 ≤ 𝑣). Thus, to
1 1 generate samples, the values 𝑣 ∼ V are passed through CDF −1 to
0.2 0.3   1 1
 1
0.3 obtain the desired values 𝑥 [CDF −1 (𝑣) = {𝑥 |CDF (𝑥) ≤ 𝑣, 𝑣 ∈
Ĉ = C ⊙ A = 0.2 1 0.07 ⊙  0.3 1 1
0.3 0.07 [0, 1]}]. Thus for Gaussian,
1  1
  1 1
 (2)
 1 0.667 0.3  ∫ 𝑥 ∫ 𝑥
 1 −𝑥 2 1 −𝑥 2
= 0.667 1 0.07 CDF (𝑥) = √ exp 2 𝑑𝑥 = √ exp 2 𝑑𝑥
 0.3 0.07 1  2𝜋 −∞ 2𝜋 0
 (4)
−𝑥 2 𝑥
If the advice says that features have low correlations (2nd rule in = [− exp( )]
example), we decrease the correlation coefficient. Now, 𝜆 must be 2 0
< 1 and we set 𝜆 = 𝑚𝑎𝑥 (|C|). Since -1 ≤ ∀𝑐 ∈ C ≤ 1.0, the value of
2
𝜆 ≤ 1.0. Thus multiplying by 𝜆 will decrease the correlation value, The inverse CDF can be thus written as CDF −1 (𝑣) = 1−exp( −𝑥2 ) ≤
and the new correlation matrix is, p
𝑣 and the desired values 𝑥 ∈ M can be obtained as 𝑥 = 2𝑙𝑛(1 − 𝑣).
 1 0.667 0.3   1 1 0.3 [Step 2]: Calculate the correlation matrix E of M.

Ĉ1 = Ĉ ⊙ A = 0.667
 1 0.07 ⊙  1
  1 0.3 [Step 3]: Calculate the Cholesky decomposition F of the corre-
 0.3 0.07 1  0.3 0.3 1  lation matrix E. Cholesky decomposition [46] of a positive-definite
    (3)
 1 0.667 0.09  matrix is given as the product of a lower triangular matrix and its con-

= 0.667 1 0.021
 jugate transpose. Note that for Cholesky decomposition to be unique,
 0.09 0.021 1  the target matrix should be positive definite, (such as the co-variance
matrix) whereas the correlation matrix, used in our algorithm, is only

This is used to create the new generated data G̃1 . For negative corre- positive semi-definite. We enforce positive-definiteness by repeated
lations, the process is unchanged. addition of very small values to the diagonal of the correlation ma-
trix until positive-definiteness is ensured. Given a symmetric and
3.2 Advice-guided data generation positive definite matrix E, its Cholesky decomposition F is such
After Ĉ1 is constructed, we next generate data satisfying the con- that E = F · F ⊤ .
straints. To this effect, we employ the Iman-Conover method [23], [Step 4]: Calculate the Cholesky decomposition Q of the correla-
a distribution free method to define dependencies between distri- tion matrix obtained after modifications based on human advice, Ĉ.
butional variables based on rank correlations such as Spearman or As above the Cholesky decomposition is such that Ĉ = Q · Q ⊤ .
Kendell Tau correlations. Since we deal with linear relationships [Step 5]: Calculate the reference matrix T by transforming the
between the features and assume a normal distribution and that sampled matrix M from step 1 to have the desired correlations of Ĉ,
Pearson coefficient has shown to perform equally well with the by using their Cholesky decompositions.
Iman-Conover method [40] due to the close relationship between [Step 6]: Rearrange values in columns of the generated data G1
Pearson and Spearman correlations, we use the Pearson correlations. to have the same ordering as corrresponding column in the reference
Further, we assume that the features are Gaussian, justified by the matrix T to obtain the final generated data G̃1 .
fact that most lab test data is continuous. The Iman-Conover method
consists of the following steps: Cholesky decomposition to model correlations: Given an ran-
[Step 1]: Create a random standardized matrix M with values domly generated data set with no correlations P, a correlation matrix
𝑥 ∈ M ∼ Gaussian distribution. This is obtained by the process of C and its Cholesky decomposition Q, data that faithfully follows
inverse transform sampling described next. Let V1 be a uniformly the given correlations ∈ C can be generated by the product of the
distributed random variable and CDF be the cumulative distribution obtained lower triangular matrix with the original uncorrelated data
KiML’20, August 24, 2020, San Diego, California, USA,
Devendra Singh Dhami, Mayukh Das, and Sriraam Natarajan

i.e. P̂=QP. The correlation of the newly obtained data, P̂ is, to plan and prognosticate treatments. The data consists of 19
features with 44 positive and 6 negative examples.
𝐶𝑜𝑣 ( P̂) E[ P̂ P̂ ⊤ ] − E[ P̂]E[ P̂] ⊤
𝐶𝑜𝑟𝑟 ( P̂) = = (5) (2) MIMIC database [24] consists of deidentified information
𝜎 P̂ 𝜎 P̂ of patients admitted to critical care units at a large tertiary
Since we consider data P̂ from a Gaussian distribution with zero care hospital. The features included are predominately time
mean and unit variance, window aggregations of physiological measurements from
the medical records. We selected relevant lab results, vital
E[ P̂ P̂ ⊤ ] − E[ P̂]E[ P̂] ⊤ sign observations and feature aggregations. The data consists
𝐶𝑜𝑟𝑟 ( P̂) = = E[ P̂ P̂ ⊤ ] = E[(QP)(QP) ⊤ ]
𝜎 P̂ of 18 with 5813 positive and 40707 negative examples.
= E[QPQ ⊤ P ⊤ ] = QE[PP ⊤ ]Q ⊤ = QQ ⊤ = C Advice Acquisition: Here we compile the sources from which we
(6) obtain the advice.
Thus Cholesky decomposition can capture the desired correlations
(1) Nephrotic Syndrome: This is a novel real data set and the ad-
faithfully and can be used for generating correlated data. Since we al-
vice is obtained from a nephrologist in India. According
ready have a normal sampled matrix M and a calculated correlation
to the problem statement from the expert, nephrotic syndrome
E of M, we need to calculate a reference matrix (step 5).
involves the loss of a lot of protein and nephritic syndrome
involves the loss of a lot of blood through urine. A kidney
3.3 Human-Allied GAN training biopsy is often required to diagnose the underlying pathol-
Since the human expert advice is provided independent of the GAN ogy in patients with suspected glomerular disease. The goal
architecture, our method is agnostic of the underlying GAN architec- of the project is to build a clinical support system that pre-
ture. We make use of Wasserstein GAN (WGAN) architecture since dicts the disease using clinical features, thus reducing the
its shown to be more stable while training and can handle mode need of kidney biopsy. Since the data collection is scarce,
collapse [3]. Only the error backpropagation values differ when we a synthetic data set can help in better understanding of the
are using the data generated by the underlying GAN or the data disease from the clinical features.
generated by the Iman-Conover method. Our algorithm starts with (2) MIMIC: The feature set and the expected correlations are
the general process of training a GAN where the generator takes obtained in consultation with trauma experts at a Dallas
random noise as an input and generates data which is then passed, hospital.
along with the real data, to the discriminator. The discriminator
tries to identify the real and generated data and the error is back All experiments were run on a 64-bit Intel(R) Xeon(R) CPU E5-2630
propagated to the generator. After every specified number of itera- v3 server for 10K epochs. Both the generator and discriminator are
tions, the correlations between features C in the generated data is neural networks with 4 hidden layers. To measure the quality of the
obtained and a new correlation matrix Ĉ, is obtained with respect generated data we make use of the train on synthetic, test on real
to the expert advice (section 3.1). A new data set is generated wrt (TSTR) method as proposed in [12]. We use gradient boosting with
Ĉ using the Iman-Conover method (Section 3.2) and then passed to 100 estimators and a learning rate of 0.01 as the underlying model.
the discriminator along with the real data set. We train the GAN for 10K epochs and provide correlation advice
every 1K iterations.
4 EXPERIMENTAL EVALUATION Table 1 shows the results of the TSTR method with data generated
with (HA-GAN𝐺𝐴 ) and without advice (GAN). It shows that the
We aim to answer the following questions:
data generated with advice has higher TSTR performance than the
Q1: Does providing advice to GANs help in generating better data generated without advice across all data sets and all metrics.
quality data? Thus, to answer Q1, providing advice to generative adversarial net-
Q2: Are GANs with advice effective for data sets that have few works captures the relationship between features better and thus are
examples? able to generate better quality synthetic data.
Q3: How does bad advice affect the quality of generated data? Learning with less data: GANs with advice are especially impres-
Q4: How well does human advice handle class imbalance? sive in nephrotic syndrome data which consists of only 50 examples
Q5: How does our method compare to state-of-the-art GAN archi- across all metrics and is thus very small in size when compared to the
tectures. number of samples typically required to train a GAN model. Thus,
We consider 2 real clinical data sets. we realize an important property of incorporating human guidance in
(1) Nephrotic Syndrome is a novel data set of symptoms that the GAN model and can answer Q2 affirmatively. The use of advice
indicate kidney damage. This consists of 50 kidney biopsy opens up the potential of using GANs in presence of sparse data
images along with the clinical reports sourced from Dr Lal samples.
PathLabs, India 2 . We use the clinical reports that consist of Effect of bad advice: Table 1 also shows the results for data gen-
the values for kidney tissue diagnosis which can confirm the erated with bad advice (HA-GAN𝐵𝐴 ). To simulate bad advice, we
clinical diagnosis and help to identify high-risk patients and follow a simple process: if the advice says that the correlation be-
influence treatment decisions and help medical practitioners tween features should be high, we set the correlations in Ĉ to 0
and if the advice says that the correlation should be low, we set the
2 https://www.lalpathlabs.com/ correlations in Ĉ to be either 1 or -1 based on whether the original
Knowledge Intensive Learning of KiML’20, August 24, 2020, San Diego, California, USA,
Generative Adversarial Networks

Table 1: TSTR Results (≈ 3 𝑑𝑒𝑐.). N/A in Nephrotic Syndrome denotes that all generated labels were of a single class (0 in our case)
and thus we were not able to run the discriminative algorithm in the TSTR method. 𝐺𝐴 and 𝐵𝐴 denotes good and bad advice to our
HA-GAN model respectively.

Data set Methods Recall F1 AUC-ROC AUC-PR
GAN 0.584 0.666 0.509 0.911
HA-GAN𝐵𝐴 0.42 0.511 0.518 0.886
medGAN N/A N/A N/A N/A
NS
medWGAN N/A N/A N/A N/A
medBGAN N/A N/A N/A N/A
HA-GAN𝐺𝐴 1.0 0.943 0.566 0.947
GAN 0.122 0.119 0.495 0.174
HA-GAN𝐵𝐴 0.285 0.143 0.459 0.235
medGAN 0.374 0.163 0.478 0.279
MIMIC
medWGAN 0.0 0.0 0.5 0.562
medBGAN 0.0 0.0 0.5 0.562
HA-GAN𝐺𝐴 0.979 0.263 0.598 0.567

correlation is positive or negative. Thus, given a correlation matrix in table 1 where advice based data generation outperforms the non-
1 advice and bad advice based data generation. Thus, we can answer
0.2 0.3 
 Q4 affirmatively.
C = 0.2 1 0.07 (7)
0.3 0.07 To answer Q5 we compare our method to 3 GAN architectures,
1 
 medGAN [8] which uses an encoder decoder framework for EHR
suppose the advice says that we need to increase the correlation data generation and its 2 variants medBGAN and medWGAN [4]
coefficient between feature 1 and feature 2. Then the new correlation and the results are shown in table 1. Our method, with good advice,
matrix after bad advice can be calculated as: outperforms the baseline both domains showing the effectiveness of
1 0.2 0.3  1 𝜆 1 our method.
  
C = 0.2 1 0.07 A = 𝜆 1 1 (8)
0.3 0.07 1 1 1 1


   5 CONCLUSION
1 0.2 0.3  1 𝜆 1 We presented a new GAN formulation that employs correlation
information between features as advice to generate new correlated

Ĉ = C ⊙ A = 0.2 1 0.07 ⊙ 𝜆 1 1 (9)
0.3 0.07 1  1 1 1 data and train the underlying GAN model. We tested our model
on real clinical data sets and show that incorporating advice helps

where 𝜆 is the factor by which the correlation value is to be aug-
generate good quality synthetic medical data. We employ TSTR
mented. Since the advice asks to increase the correlation, we set 𝜆=0.
method to test the quality of generated data and demonstrated that
Thus,
the generated data with advice is more aligned with the real data.
1 0.2 0.3  1 0 1  1 0.0 0.3  There are several future interesting directions. First, providing advice

Ĉ = 
 0.2 1 0.07 ⊙
  0 1 1  =
  0.0 1 0.07  (10)
 only when required in an active fashion can allow for significant
0.3 0.07 1  1 1 1 0.3 0.07 1  reduction in the amount of effort on the human side. Second, there
     
Similarly, if the advice says that we need to decrease the correla- can be multiple advice options, such as posterior regularization [15],
tion coefficient between feature 1 and feature 3, we set 𝜆 = 𝑓 𝑒𝑎𝑡1 . that can be used to capture feature relationships explicitly. Third,
𝑣𝑎𝑙 although we do not have identifiers in the data, thereby eliminating
1 0.2 0.3   1 0.2 1  1 0.2 1.0  the need of differential privacy [11], a general framework that can
 0.3  
Ĉ = 0.2 1 0.07 ⊙  0.2 1 1  = 0.2 1 0.07 uphold the privacy of patient data along the lines of using Cholesky
0.3 0.07 1   0.3 1 1 1  1.0 0.07 1  decomposition [7, 31] is a natural next step.

(11)
ACKNOWLEDGMENTS
As results show in table 1, giving bad advice adversely affects the
DSD and SN gratefully acknowledge DARPA Minerva award FA9550-
performance thereby answering Q3.
19-1-0391. Any opinions, findings, and conclusion or recommenda-
The nephrotic syndrome and MIMIC data sets are relatively unbal-
tions expressed in this material are those of the authors and do not
anced with a pos to neg ratio of ≈ 8:1 and 1:7 respectively. Most
necessarily reflect the view of the DARPA or the US government.
of the medical data sets, except highly curated data sets, are un-
balanced. A data generator model should be able to handle this
imbalance. Since our method explicitly focuses on the correlations REFERENCES
between features and generates better quality data based on such [1] Grigory Antipov, Moez Baccouche, and Jean-Luc Dugelay. 2017. Face aging with
conditional generative adversarial networks. In ICIP.
relationships between features, our method is quite robust to the [2] Martin Arjovsky and Leon Bottou. 2017. Towards principled methods for training
imbalance in the underlying data. This can be seen in the results generative adversarial networks. In ICLR.
KiML’20, August 24, 2020, San Diego, California, USA,
Devendra Singh Dhami, Mayukh Das, and Sriraam Natarajan

[3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. [36] V. K. Mansinghka, C. Kemp, J. B. Tenenbaum, and T. L. Griffiths. 2006. Structured
ICML (2017). Priors for Structure Learning. In UAI.
[4] Mrinal Kanti Baowaly, Chia-Ching Lin, Chao-Lin Liu, and Kuan-Ta Chen. 2019. [37] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen
Synthesizing electronic health records using improved generative adversarial Paul Smolley. 2017. Least squares generative adversarial networks. In ICCV.
networks. JAMA (2019). [38] Tom M Mitchell. 1980. The need for biases in learning generalizations. Depart-
[5] Darius Braziunas and Craig Boutilier. 2006. Preference elicitation and generalized ment of Computer Science, Laboratory for Computer Science Research, Rutgers
additive utility. In AAAI. Univ. New Jersey.
[6] Anna L Buczak, Steven Babin, and Linda Moniz. 2010. Data-driven approach [39] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018.
for creating synthetic electronic medical records. BMC medical informatics and Spectral normalization for generative adversarial networks. ICLR (2018).
decision making (2010). [40] Klemen Naveršnik and Klemen Rojnik. 2012. Handling input correlations in
[7] Jim Burridge. 2003. Information preserving statistical obfuscation. Statistics and pharmacoeconomic models. Value in Health (2012).
Computing (2003). [41] P. Odom, T. Khot, R. Porter, and S. Natarajan. 2015. Knowledge-Based Proba-
[8] Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F Stewart, bilistic Logic Learning. In AAAI.
and Jimeng Sun. 2017. Generating Multi-label Discrete Patient Records using [42] Phillip Odom and Sriraam Natarajan. 2015. Active advice seeking for inverse
Generative Adversarial Networks. In MLHC. reinforcement learning. In AAAI.
[9] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine [43] Phillip Odom and Sriraam Natarajan. 2018. Human-guided learning for proba-
Learning (1995). bilistic logic models. Frontiers in Robotics and AI (2018).
[10] Ivo D Dinov. 2016. Volume and value of big healthcare data. Journal of medical [44] Michela Paganini, Luke de Oliveira, and Benjamin Nachman. 2018. Calo-
statistics and informatics (2016). GAN: Simulating 3D high energy particle showers in multilayer electromagnetic
[11] Cynthia Dwork. 2008. Differential privacy: A survey of results. In TAMS. calorimeters with generative adversarial networks. Physical Review D (2018).
[12] Cristóbal Esteban, Stephanie L Hyland, and Gunnar Rätsch. 2017. Real-valued [45] Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised represen-
(medical) time series generation with recurrent conditional gans. arXiv preprint tation learning with deep convolutional generative adversarial networks. ICLR
arXiv:1706.02633 (2017). (2016).
[13] Maayan Frid-Adar, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit [46] Ernest M Scheuer and David S Stoller. 1962. On the generation of normal random
Greenspan. 2018. Synthetic data augmentation using GAN for improved liver vectors. Technometrics (1962).
lesion classification. In ISBI. [47] Bernhard Schölkopf, Patrice Simard, Alex J Smola, and Vladimir Vapnik. 1998.
[14] Glenn M Fung, Olvi L Mangasarian, and Jude W Shavlik. 2003. Knowledge-based Prior knowledge in support vector kernels. In Advances in neural information
support vector machine classifiers. In NIPS. processing systems. 640–646.
[15] Kuzman Ganchev, Jennifer Gillenwater, Ben Taskar, et al. 2010. Posterior regular- [48] Rittika Shamsuddin, Barbara M Maweu, Ming Li, and Balakrishnan Prabhakaran.
ization for structured latent variable models. JMLR (2010). 2018. Virtual patient model: an approach for generating synthetic healthcare time
[16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. series data. In ICHI.
[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, [49] Jude W Shavlik and Geoffrey G Towell. 1989. Combining explanation-based
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial learning and artificial neural networks. In Proceedings of the sixth international
nets. In NIPS. workshop on Machine learning. Elsevier.
[18] Peter Groves, Basel Kayyali, David Knott, and Steve Van Kuiken. 2016. The’big [50] Geoffrey G Towell and Jude W Shavlik. 1994. Knowledge-based artificial neural
data’revolution in healthcare: Accelerating value and innovation. (2016). networks. Artificial intelligence (1994).
[19] John T Guibas, Tejpal S Virdi, and Peter S Li. 2017. Synthetic medical images [51] Yan Wang, Biting Yu, Lei Wang, Chen Zu, David S Lalush, Weili Lin, Xi Wu, Jiliu
from dual generative adversarial networks. arXiv preprint arXiv:1709.01872 Zhou, Dinggang Shen, and Luping Zhou. 2018. 3D conditional generative adver-
(2017). sarial networks for high-quality PET image estimation at low dose. NeuroImage
[20] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C (2018).
Courville. 2017. Improved training of wasserstein gans. In NIPS. [52] Zongwei Wang, Xu Tang, Weixin Luo, and Shenghua Gao. 2018. Face aging with
[21] Haroun Habeeb, Ankit Anand, Mausam Mausam, and Parag Singla. 2017. Coarse- identity-preserved conditional generative adversarial networks. In CVPR.
to-fine lifted MAP inference in computer vision. In IJCAI. [53] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum.
[22] Zhiting Hu, Zichao Yang, Russ R Salakhutdinov, LIANHUI Qin, Xiaodan Liang, 2016. Learning a probabilistic latent space of object shapes via 3d generative-
Haoye Dong, and Eric P Xing. 2018. Deep Generative Models with Learnable adversarial modeling. In NIPS.
Knowledge Constraints. In NeurIPS. [54] S. Yang and S. Natarajan. 2013. Knowledge Intensive Learning: Combining
[23] Ronald L Iman and William-Jay Conover. 1982. A distribution-free approach to Qualitative Constraints with Causal Independence for Parameter Learning in
inducing rank correlation among input variables. Communications in Statistics- Probabilistic Models. In ECMLPKDD.
Simulation and Computation (1982). [55] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired
[24] Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, image-to-image translation using cycle-consistent adversarial networks. In ICCV.
Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi,
and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database.
Scientific data (2016).
[25] Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator archi-
tecture for generative adversarial networks. In CVPR.
[26] Harold William Kuhn and Albert William Tucker. 1953. Contributions to the
Theory of Games.
[27] Gautam Kunapuli, Phillip Odom, Jude W Shavlik, and Sriraam Natarajan. 2013.
Guiding autonomous agents to better behaviors through human advice. In ICDM.
[28] Quoc V Le, Alex J Smola, and Thomas Gärtner. 2006. Simpler knowledge-based
support vector machines. In ICML.
[29] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature
(2015).
[30] Minjun Li, Haozhi Huang, Lin Ma, Wei Liu, Tong Zhang, and Yugang Jiang.
2018. Unsupervised image-to-image translation with stacked cycle-consistent
adversarial networks. In ECCV.
[31] Yaping Li, Minghua Chen, Qiwei Li, and Wei Zhang. 2011. Enabling multilevel
trust in privacy preserving data mining. TKDE (2011).
[32] Yujia Li, Kevin Swersky, and Rich Zemel. 2015. Generative moment matching
networks. In ICML.
[33] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. 2017. Unsupervised image-to-image
translation networks. In NIPS.
[34] Ming-Yu Liu and Oncel Tuzel. 2016. Coupled generative adversarial networks. In
NIPS.
[35] Faisal Mahmood, Richard Chen, and Nicholas J Durr. 2018. Unsupervised reverse
domain adaptation for synthetic medical images via adversarial training. IEEE
transactions on medical imaging (2018).
Depressive, Drug Abusive, or Informative:
Knowledge-aware Study of News Exposure during COVID-19
Outbreak
Amanuel Alambo Manas Gaur Krishnaprasad Thirunarayan
Knoesis Center AI Institute, University of South Knoesis Center
Dayton, Ohio Carolina Dayton, Ohio
amanuel@knoesis.org Columbia, South Carolina tkprasad@knoesis.org
mgaur@email.sc.edu

ABSTRACT on Knowledge-infused Mining and Learning (KiML’20). , 5 pages. https://doi.
The COVID-19 pandemic is having a serious adverse impact on org/10.1145/nnnnnnn.nnnnnnn
the lives of people across the world. COVID-19 has exacerbated
community-wide depression, and has led to increased drug abuse
brought about by isolation of individuals as a result of lockdown.
1 INTRODUCTION
Further, apart from providing informative content to the public,
the incessant media coverage of COVID-19 crisis in terms of news COVID-19 pandemic has changed our societal dynamics in different
broadcasts, published articles and sharing of information on social ways due to the varying impact of the news articles and broadcasts
media have had the undesired snowballing effect on stress levels on a diverse population in the society. Thus, it is important to
(further elevating depression and drug use) due to uncertain future. place the news articles in their spatio-temporal-thematic (Nagarajan
In this position paper, we propose a novel framework for assessing et al., 2009; Andrienko et al., 2013; Harbelot et al., 2015) contexts to
the spatio-temporal-thematic progression of depression, drug abuse, offer appropriate and timely response and intervention. In order
and informativeness of the underlying news content across the to limit the scope of this research agenda, we propose to focus
different states in the United States. Our framework employs an on identifying regions that are exposed to depressive and drug
attention-based transfer learning technique to apply knowledge abusive news articles and to determine/recommend ways for timely
learned on a social media domain to a target domain of media interventions by epidemiologists.
exposure. To extract news articles that are related to COVID-19 The impact of COVID-19 on mental health has been investigated
communications from the streaming news content on the web, we in recent studies (Garfin et al., 2020; Holmes et al., 2020; Qiu et al.,
use neural semantic parsing, and background knowledge bases in a 2020). [4] studied the impact of repeated media exposure on the men-
sequence of steps called semantic filtering. We achieve promising tal well-being of individuals and its ripple effects. [8] underscore
preliminary results on three variations of Bidirectional Encoder the importance of a multidisciplinary study to better understand
Representations from Transformers (BERT) model. We compare COVID-19. Specifically, the study explores its psychological, social,
our findings against a report from Mental Health America and the and neuroscientific impacts. [12] studied the psychological impact
results show that our fine-tuned BERT models perform better than COVID-19 lockdown had on the Chinese population. These studies,
vanilla BERT. Our study can benefit epidemiologists by offering however, do not adequately explore a technique to computationally
actionable insights on COVID-19 and its regional impact. Further, analyze the regional repercussions associated with media exposure
our solution can be integrated into end-user applications to tailor to COVID-19 that may provide a better basis for local grassroots
news for users based on their emotional tone measured on the scale level action.
of depressiveness, drug abusiveness, and informativeness. We propose an approach to measure depressiveness, drug abu-
siveness, and informativeness as a result of media exposure for
various states in the US in the months from January 2020 to March
KEYWORDS
2020. Our study is focused on the first quarter of 2020 as this period
COVID-19; Spatio-Temporal-Thematic; Depressiveness; Drug was critical in the spread of COVID-19 and its ominous impact;
Abuse; Informativeness; Transfer Learning this was a period when the public faced major changes to lifestyle
ACM Reference Format: including lockdown, social distancing, closure of businesses, unem-
Amanuel Alambo, Manas Gaur, and Krishnaprasad Thirunarayan. 2020. ployment, and broadly speaking, complete lack of control over the
Depressive, Drug Abusive, or Informative: Knowledge-aware Study of News unfolding situation precipitating in severe uncertainty about the
Exposure during COVID-19 Outbreak . In Proceedings of KDD Workshop impending future. In consequence, this continued media exposure
progressively worsened the mental health of individuals across the
board. We analyze and score news content on three orthogonal
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, dimensions: spatial, temporal, and thematic. For spatial, we use
California, USA, August 24, 2020. Use permitted under Creative Commons License state boundaries. For temporal, we use monthly data analysis. For
Attribution 4.0 International (CC BY 4.0).
thematic, we score news content on the category/dimension of
KiML’20, August 24, 2020, San Diego, California, USA,
© 2020 Copyright held by the author(s). depression, drug abuse and informativeness (relevant to COVID-19
https://doi.org/10.1145/nnnnnnn.nnnnnnn but not directly connected to either depression or drug-abuse).
and grouped the ones that are from the US based on their state
of origination. The state-level grouped news articles had a total
of over 150K entities identified using DBpedia spotlight service2 .
However, since using a coarse filtering service such as DBpedia
spotlight over the entire news articles is not efficient and brings
in irrelevant entities, and thus noisy news articles, we utilize (“i”)
a neural parsing approach with self-attention (Wu et al., 2019) to
extract relevant entities. After extracting relevant entities and news
articles, we use (“ii”) DBpedia spotlight service to identify news
articles that are related to online communications about COVID-19.

Figure 1: Spatio-Temporal-Thematic Dimensions
Figure 2: Knowledge-based entity extraction using Semantic
Filtering
Our study hinges on the use of domain-specific language model-
ing and transfer learning to better understand how depressiveness, For this task, we explored 780 DBpedia categories that are rel-
drug abusiveness, and informativeness of news articles evolve in evant to COVID-19 communications to create the most relevant
response to media exposure by people. We conduct the transfer set of entities and news articles. Further, upon inspection of the
of knowledge learned on a social media platform to the domain news articles, we discovered medical terms that were not available
of exposure to news using variations of the attention-based BERT in DBpedia. As a result, we used (“iii”) the MeSH terms hierarchy
model (Devlin et al., 2018), also called Vanilla BERT. Thus, in addi- in Unified Medical Language System (UMLS), the Diagnostic and
tion to vanilla BERT, we fine-tune BERT models on corpora that Statistical Manual for Mental Disorders (DSM-5) lexicon (Gaur et al.,
are representative of depression and drug abuse. Then, we compare 2018), and Drug Abuse Ontology (DAO), collectively referred to
results obtained using the three variants of the BERT model. For as Mental Health and Drug Abuse Knowledgebase (MHDA-Kb) to
scoring depressiveness, drug abusiveness, and informativeness of spot additional entities. Thus, from 700K unique news articles
news articles, we utilize entities from structured domain knowledge (which are extracted from the total of 1.2 Million news articles by
from the Patient Health Questionnaire (PHQ-9) lexicon (Yazdavar removing duplicates), we created a set of 120K unique entities that
et al., 2017), Drug Abuse Ontology (DAO) (Cameron et al., 2013), are described by the 780 DBpedia categories and 225 concepts in
and DBpedia (Lehmann et al., 2015). PHQ-9 lexicon is a knowl- MHDA-Kb. The figures below show two examples that illustrate
edge base developed specifically for assessing depression, and DAO entities spotted during entity extraction on a sample news article.
is built to study drug abuse. Similarly, we use DBpedia, which is A news article that has entities identified using this sequence of
a generic and comprehensive knowledge base, for assessing the steps is selected for our study.
informativeness of news content.
Having determined the scores for depressiveness, drug abusive-
ness, and informativeness of news articles for each state during
the three months, we computed the aggregate score for each the-
matic category by summing up the scores for the news articles. We
finally assigned the category with the highest score as a label for
a state. For instance, if the aggregate score of depressiveness for
the state of Iowa in the month of January 2020 is the highest of
the three thematic categories, then the state of Iowa is assigned a Figure 3: Example entity extraction-I using Semantic Filter-
label of depression for that month, which means the state of Iowa ing
is most exposed to depressive news contents. Thus, identifying
which states are consistently exposed to depressive or drug abusive
news contents enables policy makers and epidemiologists to devise
appropriate intervention strategies.

2 DATA COLLECTION
We collected 1.2 Million news articles from the Web and GDELT1 (a
resource that stores world news on significant events from different
countries) using semantic filtering (Sheth and Kapanipathi, 2016) Figure 4: Example entity extraction-II using Semantic Filter-
and spanning the period from January 01, 2020, to March 29, 2020. ing
We filtered news articles that did not originate from within the US
1 https://www.gdeltproject.org/ 2 https://www.dbpedia-spotlight.org/

2
3 METHODS scores of news articles as described. The category with the highest
We propose to use three variations of the BERT model for represent- cumulative score is set as the label for a state.
ing news articles. In its basic form, we use vanilla BERT for encoding Using vanilla-BERT (Figure 5), we can see that no state shows
news articles. For the remaining two variations, we fine-tune BERT exposure to news content on drug abuse in January. Going from
on a binary sequence classification task by independently training February to March, we see depressive news content move from
on two corpora using masked language modeling (MLM) and next inner-most states such as Missouri, Kansas, and Colorado to border
sentence prediction (NSP) objectives. The two corpora used are: 1) states such as California, Montana, North Dakota, and Louisiana,
Subreddit Depression (Gkotsis et al., 2017; Gaur et al., 2018); 2) A making way for informative news content. Further, there are fewer
combination of subreddits: Crippling Alcoholism, Opiates, Opiates states exposed to drug-related news content than those exposed
Recovery, and Addiction (abbreviated COOA), each consisting of to depressive or informative news content in February or March.
Reddit posts about drug abuse. Subreddit Depression has 760049 Particularly, Arizona and Virginia show consistent exposure to
posts across 121795 Redditors, and COOA has 1416765 posts from drug-related news content in February and March.
46183 users, both consisting of posts from the years 2005 - 2016. Using depression-BERT, as shown in Figure 6, we see that states
Reddit posts belonging to subreddits depression or COOA are con- such as Texas, and Kansas are exposed to depressive news content
sidered positive classes and the 380444 posts from control group for the month of January and February while states such as Cali-
(∼10K subreddits unrelated to mental health) as negative classes. fornia, Montana, Alaska, and Michigan show higher consumption
We use the following settings for training our BERT model for se- of depressive news content in February and March. With regard to
quence classification: training batch size of 16, maximum sequence informativeness, we see an overall even distribution of informative
length of 256, Adam optimizer with learning rate of 2e-5, number of news content across the nation in February and March. Further,
training epochs set to 10, and a warmup proportion of 0.1. We used we see a few midwest states showing relatively higher instances of
40%-60% split for training and testing sets for creating the BERT news content that are informative than depressive in February and
models and achieved a test accuracy of 89% for Depression-BERT March. It’s interesting to see a few southern states such as Okla-
and 78% for Drug Abuse-BERT. We set the size of the training set homa, Texas, and Arkansas transition from exposure to depressive
smaller than the testing set for generalizability of our models. In news content in the month of February to drug use related news
this manuscript, we refer to the BERT model fine tuned on subreddit content in the month of March.
depression as Depression-BERT or DPR-BERT, while the one fine Using Drug Abuse-BERT model (Figure 7), states such as Texas,
tuned on subreddit COOA as Drug Abuse-BERT or DA-BERT. and Wisconsin shift from exposure of depressive news content in
In addition to using BERT for encoding news contents, we also January to exposure of drug-related news content in February, while
use it for representing the entities in the background knowledge states such as California, and Oklahoma transition from exposure to
bases (i.e., PHQ-9, DAO, and DBpedia). Once we have encoded the depressive news content in February to drug-related news content
news articles and the entities in the knowledge bases using vanilla in March. Further, we see the informativeness of news content
BERT or fine-tuned BERT model, we generated depressiveness sweeping from the east to the midwest, to parts of the south, and
score, drug abusiveness score, and informativeness score corre- to some parts of the west from February to March.
sponding to the entities in PHQ-9, DAO, and DBpedia respectively. Our results show that a fine-tuned BERT model cleanly separates
The equation below gives the score of a news article for a category the thematic categorical scores to a state. For instance, using DA-
given one of the BERT models: BERT for the month of March, the drug abuse score for the state
of California is much higher than the score of depressiveness or
informativeness for the same state. However, with the vanilla BERT
|E𝐾𝐵 |
1 Õ model, the three scores computed for the various states and months
𝑆𝑐𝑜𝑟𝑒𝑐𝑚 (𝑛𝑒𝑤𝑠) = 𝑐𝑜𝑠𝑠𝑖𝑚 (news, 𝑒) (1) are marginally different. Moreover, the results using DPR-BERT or
|E𝐾𝐵 | 𝑒=1
DA-BERT capture the state-level ranking of mental disorders by
Mental Health America 3 better than vanilla-BERT; for a few states,
where,
the fine-tuned BERT models identify more months to have media
m ∈ {vanilla-BERT, DPR-BERT, DA-BERT}
exposure to depression or drug abuse news content.
c ∈ {informativeness, depressiveness, drug abuse}
cossim (news, e): cosine similarity between a news content and As indicated in Table 1, we report months showing predominant
an entity in KB media exposure to either depressive or drug abuse news articles
KB - a collection of entities present in PHQ-9, DBpedia, or DAO using the three variants of BERT model. We use 10 of the 13 states
recognized as showing high prevalence of mental disorders accord-
We used the base variant of the BERT model with 12 layers, 768 ing to a report by Mental Health America on overall mental disorder
hidden units, and 12 attention heads. We use PyTorch 1.5.0+cu101 ranking. The 3 states not included in this table are Washington,
for fine-tuning our BERT models. All our programs were run on Wyoming, and Idaho. We did not consider these 3 states as these
Google Colab’s NVIDIA Tesla P100 PCI-E GPU. states were not in our dataset cohort. For the Mental Health Amer-
ica (MHA) report, we make a practical assumption that each of the
4 PRELIMINARY RESULTS AND DISCUSSION three months is either depressive or drug abusive for each state.
Thus, our objective is to maximize the number of months with
In this section, we report the state-wise labels (i.e., depressive, drug
abusive, informative) for each month obtained after summing the 3 https://www.mhanational.org/issues/ranking-states

3
Figure 5: vanilla BERT modeling of Depressiveness, Drug Abuse, and Informativeness in US states.

Figure 6: Depression-BERT (DPR-BERT) modeling of Depressiveness, Drug Abuse, and Informativeness in US states

Figure 7: Drug Abuse BERT (DA-BERT) modeling of Depressiveness, Drug Abuse, and Informativeness in US states

exposure to depressive/drug abuse news content for each of the where,
10 states. We can see in Table 1 that fine-tuned BERT models help 𝑚 1, 𝑚 2 ∈ {vanilla-BERT, DPR-BERT, DA-BERT, MHA}
identify more months to having exposure to depressive or drug 𝑆 - Set of States in the US (Table 1)
abuse news content than vanilla BERT does for the 10 states. For ex- 𝑚𝑀 𝑀
1 , 𝑚 2 : Number of depressive, drug abusive, or informative
ample, using DA-BERT, five states are identified to have at least two months for a state “i”
months showing exposure to depressive/drug abuse news content We report inter-model and model-to-MHA Jaccard similarity
while DPR-BERT identifies six states to having been exposed to scores computed using equation (2) in Figure 8.
depressive/drug abuse news content for two months. On the other As shown in Figure 8, DA-BERT gives the best results against
hand, vanilla-BERT identifies only two states with depressive/drug MHA report in Jaccard similarity (0.53), which means DA-BERT
abuse news content for two months. To compare models with one identifies over half of the state-to-month instances in MHA. On the
another and against the report by Mental Health America (MHA), other hand, vanilla-BERT has a Jaccard similarity of 0.37 with MHA,
we compute a Jaccard Index between each pair of models and each which can be interpreted as vanilla-BERT identifies a little over
model against the report from MHA. The equation below computes one-third of the state-to-month instances in MHA. The best Jaccard
Jaccard similarity between the results of two models or a model’s similarity is achieved between DPR-BERT and vanilla-BERT (0.7);
results with an MHA report. thus, 70% of state-to-month mappings are shared between DPR-
BERT and vanilla-BERT based on Jaccard index. It’s interesting to
|𝑆 | see DA-BERT has the same Jaccard similarity with vanilla-BERT
Õ 𝑚𝑀 ∩ 𝑚𝑀
1 2
𝐽 (𝑚 1, 𝑚 2 ) = 𝑀 𝑀
(2)
𝑖 ∈ 𝑆 𝑚1 ∪ 𝑚2
4
MHA States vanilla- DA-BERT DPR-BERT from Mental Health America. In the future, we plan to incorporate
with high BERT (Months (Months background knowledge bases in our attention-based transfer learn-
DPR and DA (Months with depres- with ing framework to further investigate knowledge-infused learning
with depres- sion/drug depres- (Kursuncu et al., 2019).
sion/drug abuse) sion/drug
abuse) abuse) REFERENCES
[1] Gennady Andrienko, Natalia Andrienko, Harald Bosch, Thomas Ertl, Georg Fuchs,
Tennessee Feb, Mar Feb, Mar Feb, Mar Piotr Jankowski, and Dennis Thom. 2013. Thematic patterns in georeferenced
Alabama Feb Feb, Mar Feb tweets through space-time visual analytics. Computing in Science & Engineering
15, 3 (2013), 72–82.
Oklahoma Mar Feb, Mar Feb, Mar [2] Delroy Cameron, Gary A Smith, Raminta Daniulaityte, Amit P Sheth, Drashti
Kansas Feb Jan, Feb Jan, Feb Dave, Lu Chen, Gaurish Anand, Robert Carlson, Kera Z Watkins, and Russel
Falck. 2013. PREDOSE: a semantic web platform for drug abuse epidemiology
Montana Mar Feb Feb, Mar using social media. Journal of biomedical informatics 46, 6 (2013), 985–997.
South Carolina Mar Mar Feb, Mar [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
Alaska Feb, Mar Jan, Feb, Mar Feb, Mar Pre-training of deep bidirectional transformers for language understanding. arXiv
preprint arXiv:1810.04805 (2018).
Utah Mar Mar Mar [4] Dana Rose Garfin, Roxane Cohen Silver, and E Alison Holman. 2020. The novel
Oregon None Feb None coronavirus (COVID-2019) outbreak: Amplification of public health consequences
Nevada Feb Feb None by media exposure. Health psychology (2020).
[5] Manas Gaur, Ugur Kursuncu, Amanuel Alambo, Amit Sheth, Raminta Daniu-
Table 1: Evaluation of base and domain-specific BERT mod- laityte, Krishnaprasad Thirunarayan, and Jyotishman Pathak. 2018. " Let Me Tell
You About Your Mental Health!" Contextualized Classification of Reddit Posts to
els for MHA states over the period of three months (January, DSM-5 for Web-based Intervention. In Proceedings of the 27th ACM International
February, and March). These three months showed high dy- Conference on Information and Knowledge Management. 753–762.
[6] George Gkotsis, Anika Oellrich, Sumithra Velupillai, Maria Liakata, Tim JP Hub-
namicity in COVID-19 spread. bard, Richard JB Dobson, and Rina Dutta. 2017. Characterisation of mental health
conditions in social media using Informed Deep Learning. Scientific reports 7
(2017), 45141.
[7] Benjamin Harbelot, Helbert Arenas, and Christophe Cruz. 2015. LC3: A spatio-
temporal and semantic model for knowledge discovery from geospatial datasets.
Journal of Web Semantics 35 (2015), 3–24.
[8] Emily A Holmes, Rory C O’Connor, V Hugh Perry, Irene Tracey, Simon Wes-
sely, Louise Arseneault, Clive Ballard, Helen Christensen, Roxane Cohen Silver,
Ian Everall, et al. 2020. Multidisciplinary research priorities for the COVID-19
pandemic: a call for action for mental health science. The Lancet Psychiatry
(2020).
[9] Ugur Kursuncu, Manas Gaur, and Amit Sheth. 2019. Knowledge Infused Learning
(K-IL): Towards Deep Incorporation of Knowledge in Deep Learning. arXiv
preprint arXiv:1912.00512 (2019).
[10] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas,
Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören
Auer, et al. 2015. DBpedia–a large-scale, multilingual knowledge base extracted
from Wikipedia. Semantic Web 6, 2 (2015), 167–195.
[11] Meenakshi Nagarajan, Karthik Gomadam, Amit P Sheth, Ajith Ranabahu,
Raghava Mutharaju, and Ashutosh Jadhav. 2009. Spatio-temporal-thematic analy-
sis of citizen sensor data: Challenges and experiences. In International Conference
Figure 8: Inter-BERT model and BERT Model-to-MHA Jac- on Web Information Systems Engineering. Springer, 539–553.
card Similarity Scores as a measure of closeness of model’s [12] Jianyin Qiu, Bin Shen, Min Zhao, Zhen Wang, Bin Xie, and Yifeng Xu. 2020. A
nationwide survey of psychological distress among Chinese people in the COVID-
prediction to an extensive survey on Mental Health America 19 epidemic: implications and policy recommendations. General psychiatry 33, 2
(MHA). (2020).
[13] Amit Sheth and Pavan Kapanipathi. 2016. Semantic filtering for social data. IEEE
Internet Computing 20, 4 (2016), 74–78.
[14] Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang, and
and DPR-BERT, subsuming the former and being subsumed by the Xing Xie. 2019. Npa: Neural news recommendation with personalized attention.
In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge
latter in terms of depressive/drug abusive months. Discovery & Data Mining. 2576–2584.
[15] Amir Hossein Yazdavar, Hussein S Al-Olimat, Monireh Ebrahimi, Goonmeet
5 CONCLUSION Bajaj, Tanvi Banerjee, Krishnaprasad Thirunarayan, Jyotishman Pathak, and
Amit Sheth. 2017. Semi-supervised approach to monitoring clinical depressive
In this paper, we model depressiveness, drug abusiveness, and in- symptoms in social media. In Proceedings of the 2017 IEEE/ACM International
Conference on Advances in Social Networks Analysis and Mining 2017. 1191–1198.
formativeness of news articles to assess the dominant category
characterizing each US state during each of the three months (Jan
2020 to Mar 2020). We demonstrate the power of transfer learning
by fine-tuning an attention-based deep learning model on a dif-
ferent domain and use the domain-tuned model for gleaning the
nature of media exposure. Specifically, we use background knowl-
edge bases for measuring depressiveness, drug abusiveness, and
informativeness of news articles. We found out DA-BERT identifies
the most number of state-to-month instances as being exposed
to depressive or drug abuse news content according to the report
5
Cost Aware Feature Elicitation
Srijita Das Rishabh Iyer Sriraam Natarajan
The University of Texas at Dallas The University of Texas at Dallas The University of Texas at Dallas
Srijita.Das@utdallas.edu Rishabh.Iyer@utdallas.edu Sriraam.Natarajan@utdallas.edu

ABSTRACT tests for reasonably accurate prediction. We build on the intuition
Motivated by clinical tasks where acquiring certain features such that given certain observed features like one’s demographic details,
as FMRI or blood tests can be expensive, we address the problem of the most important features for a patient depends on the important
test-time elicitation of features. We formulate the problem of cost- features for similar patients. Based on this intuition, we find out
aware feature elicitation as an optimization problem with trade-off similar data points in the observed feature space and identify the
between performance and feature acquisition cost. Our experiments important feature subsets of these similar instances by employing
on three real-world medical tasks demonstrate the efficacy and a greedy information theoretic feature selector objective.
effectiveness of our proposed approach in minimizing costs and Our contributions in this work are as follows: (1) formalize the
maximizing performance. problem as a joint optimization problem of selecting the best feature
subset for similar data points and optimizing the loss function using
CCS CONCEPTS the important feature subsets. (2) account for acquisition cost in
both the feature selector objective and classifier objective to balance
• Supervised learning → Budgeted learning; Feature selec-
the trade-off between acquisition cost and model performance. (3)
tion; • Applications → Healthcare.
empirically demonstrate the effectiveness of the proposed approach
KEYWORDS on three real-world medical data sets.
cost sensitive learning, supervised learning, classification
ACM Reference Format: 2 RELATED WORK
Srijita Das, Rishabh Iyer, and Sriraam Natarajan. 2020. Cost Aware Feature
The related work on cost-sensitive feature selection and learning
Elicitation. In Proceedings of KDD Workshop on Knowledge-infused Min-
ing and Learning (KiML’20). , 6 pages. https://doi.org/10.1145/nnnnnnn. can be categorized into the following four broad approaches.
nnnnnnn Tree based budgeted learning: Prediction time elicitation of fea-
tures under a cost budget has been widely studied in literature. A
lot of work has been done in tree based models [5, 16, 17, 26–28]
1 INTRODUCTION
by adding cost term to the tree objective function in either deci-
In supervised classification setting, every instance has a fixed fea- sion trees or ensemble methods like gradient boosted trees. All
ture vector and a discriminative function is learnt on such fixed- these methods aim to build an adaptive and complex decision tree
length feature vector and it’s corresponding class variable. However, boundary by considering trade-off between performance and test-
a lot of practical problems like healthcare, network domains, de- time feature acquisition cost. While we are similar in motivation to
signing survey questionnaire [19, 20] etc has an associated feature these approaches, our methodology is different in the sense that
acquisition cost. In such domains, there is a cost budget and get- we do not consider tree based models. Instead our approach aims
ting all the features of an instance can be very costly. As a result, to find local feature subsets using an information theoretic feature
many cost sensitive classifier models [2, 8, 24] have been proposed selector for different clusters of training instance build in a lower
in literature to incorporate the cost of acquisition into the model dimensional space.
objective during training and prediction. Adaptive classification and dynamic feature discovery: Our
Our problem is motivated by such a cost-aware setting where the work also draws inspiration from Nan al.’s work [15] where they
assumption is that prediction time features have an acquisition cost learn a high performance costly model and approximate the model’s
and adheres to a strict budget. Consider a patient visiting a doctor performance adaptively by building a low cost model and gating
for some potential diagnosis of a disease. For such a patient, infor- function which decides which model to use for specific training in-
mation like age, gender, ethnicity and other demographic features stances. This adaptive switching between low and high cost model
are easily available at zero cost. However, various lab tests that the takes care of the trade-off between cost and performance. Our
patient needs to undergo incurs cost. So, a training model should be method is different from theirs because we do not maintain a high
able to identify the most relevant (i.e. those which are most infor- cost model which is costly to build and and difficult to decide. We
mative, yet least costly) lab tests that are required for each specific refine the parameters of a single low cost model by incorporating a
patient. The intuition of this work is that different patients, depend- cost penalty in the feature selector and model objective. Our work
ing on their history, ethnicity, age and gender, may require different is also along the direction of Nan et al.’s work [18] where they select
varying feature subsets for test instance using neighbourhood in-
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, formation of the training data. While calculating the neighborhood
California, USA, August 24, 2020. Use permitted under Creative Commons License information from training data is similar to building clusters in
Attribution 4.0 International (CC BY 4.0).
our approach, the training neighborhood for our method is on just
KiML’20, August 24, 2020, San Diego, California, USA,
© 2020 Copyright held by the author(s). the observed feature space. Moreover, we incorporate the neigh-
https://doi.org/10.1145/nnnnnnn.nnnnnnn bourhood information in the training algorithm whereas Nan et
KiML’20, August 24, 2020, San Diego, California, USA,
Srijita Das, Rishabh Iyer, and Sriraam Natarajan

al.’s work is a prediction time algorithm. Ma et al. [10] also address observed features to find similar instances in the training set and
this problem of dynamic discovery of features based on generative identify the important feature subsets for each of these clusters
modelling and Bayesian experimental design. based on a feature selector objective function which balances the
Feature elicitation using Reinforcement learning: There is trade-off between choosing the important features and the cost at
another line of work along the sequential decision making liter- which these features are acquired.
ature [4, 9, 22] to model the test time elicitation of features by
learning the optimal policy of test feature acquisition. Along this
direction, our work aligns with the work of Shim et al. [25] where 3.2 Proposed solution
they jointly train a classifier and RL agent together. Their classifier As a first step, we cluster the training instances based on just the
objective function is similar to our method with a cost penalty, observed zero cost feature set O. The intuition is that instances
however they use a Deep RL agent to figure out the policy. We on with similar features will also have similar characteristics in terms
the other hand use localised feature selector to find the important of which elicitable features to order. For example, in a medical appli-
feature subsets for the underlying training clusters in the observed cation, whether to request for a blood test or a ct-scan will depend
feature space. on factors such as age, gender, ethnicity and whether patients with
Active Feature Acquisition: Our problem set-up is also inspired similar demographic features had requested these tests. Also, since
by work along active feature acquisition [13, 14, 19, 23, 29] where the feature set O, comes at zero cost, we assume that for unseen
certain feature subsets are observed and rest are acquired at a cost. test instances, this feature set is observed.
While all the above mentioned work follow this problem set up
during training time and typically use active learning to seek infor-
mative instances at every iteration, we use this particular setting
for test instances. Unlike their work, all the training instances in
our work are fully observed and the assumption is that the feature
acquisition cost has already being paid during training. Also, we
address a supervised classification problem instead of an active
learning set up. Our problem set up is similar to Kanani et al. [6] as
they also have partial test instances, however their problem is that
of instance acquisition where the acquired feature subset is fixed.
Figure 1: Optimization framework for the proposed problem
Our method aims at discovering variable length feature subsets for
various underlying clusters.
Our contributions: Although the problem of prediction time fea-
ture elicitation has been explored in literature from various direc-
tions and with various assumptions, we come up with an intu-
itive solution to this problem and formulate the problem in a two We propose a model which consists of a parameterized feature
step optimization framework. We incorporate acquisition cost selector module 𝐹 (𝑋, E𝑐𝑖 , 𝛼) which takes in a set of input instances
in both the feature selector and model objectives to balance the 𝐸𝑐𝑖 belonging to the cluster 𝑐𝑖 based on the feature set O and pro-
performance and cost trade-off. The problem set up is naturally duces a subset 𝑋 of most important features for the classification
applicable in real world health care and other domains where the task. The feature selection model is based on an information- theo-
knowledge of the observed features also needs to be accounted retic objective function and is augmented with the feature cost to
while selecting the elicitable features . account for the trade off between model performance and acquisi-
tion cost at test-time. The output feature subset from the feature
3 COST AWARE FEATURE ELICITATION selector module are used to update the parameters of the classifier.
The optimization framework is shown in Figure 1
3.1 Problem setup Information theoretic Feature selector model: The feature
Given: A dataset {(𝑥 1, 𝑦1 ), · · · , (𝑥𝑛 , 𝑦𝑛 )} with each 𝑥𝑖 ∈ R𝑑 as the selector module selects the best subset of features for each cluster
feature set. Each feature has an associated cost 𝑟𝑖 . of training data based on an information theoretic objective score.
Objective: Learn a discriminative model which is aware of the fea- Since at test time, we do not know the elicitable feature subset E
ture costs and can balance the trade-off between feature acquisition (since the goal of feature selection is in the first place to find the
cost and model performance. truly necessary features for learning). Hence we propose to use the
We make an additional assumption here that there is a subset of fea- closest set of instances in the training data to the current instance.
tures which have 0 cost. These could be, for example, demographic Since we assume that the training data has already been elicited,
information (e.g. age, gender, etc) in a medical domain which are we have all the features observed in the training data. We compute
easily available/less cumbersome to obtain as compared to other this distance just based on the observed feature set O. We cluster
features. In other words, we can partition the feature set F = O ∪ E the training data based on the observed features into m clusters
where O are the zero cost observed features and E are the elicitable 𝑐 1, 𝑐 2, · · · 𝑐𝑚 . Next, we use the Minimum-Redundancy-Maximum
features which can be acquired at a cost. We also assume that the Relevance (MRMR) feature Selection paradigm [1, 21]. We denote
training data is completely available with all features (i.e. the cost parameters [𝛼𝑐1𝑖 , 𝛼𝑐2𝑖 , 𝛼𝑐3𝑖 , 𝛼𝑐4𝑖 ] as parameters of a particular cluster
for all the features has already been paid). The goal is to use these 𝑐𝑖 . The feature selection module is a function of the parameters of
KiML’20, August 24, 2020, San Diego, California, USA,
Cost Aware Feature Elicitation

the cluster to which a set of instances belong and is defined as: where 𝜆1 and 𝜆2 are hyper-parameters. In the above equation, 𝜃
Õ is the parameter of the model and can be updated by standard
𝐹 (𝑋, E𝑐𝑖 , 𝛼𝑐𝑖 ) = 𝛼𝑐1𝑖 𝐼 (E𝑝 ; 𝑌 ) gradient based techniques. This loss function takes into account the
E𝑝 ∈𝑋 important feature subset for each cluster and updates the parameter
accordingly. The classifier objective also consists of a cost term
| {z }
max. relevance
denoted by 𝑐 (𝑋𝛼𝑖 ) to account for the cost of the selected feature
subset. For hard budget on the elicited features, the cost component
Õ © Õ Õ
− 𝛼𝑐2𝑖 𝐼 (E 𝑗 ; E𝑝 ) − 𝛼𝑐3𝑖 𝐼 (E𝑝 ; E 𝑗 |𝑌 ) ®
ª
E𝑝 ∈𝑋 « E 𝑗 ∈𝑋 E 𝑗 ∈𝑋 (1) in the model objective can be considered. In case of a cost budget,
this component can be ignored because the elicited feature subset
¬
| {z }
Õ min. redundancy adheres to a fixed cost and hence, this term is constant.
− 𝛼𝑐4𝑖 𝑐 (E𝑝 )
E𝑝 ∈𝑋 3.3 Algorithm
| {z } We present the algorithm for Cost Aware Feature Elicitation
cost penalty (CAFE) in Algorithm 1. CAFE takes as input set of training examples
where 𝐼 (E𝑝 ; 𝑌 ) is the mutual information between the random vari- E, the zero cost feature set O, the elicitable feature subset E, a cost
able E𝑝 (feature) and 𝑌 (target). In the above equation, the feature vector 𝑀 ∈ R𝑑 and a budget 𝐵. Each element in the training set E
subset 𝑋 is grown greedily using a greedy optimization strategy consists of a tuple (𝑥, 𝑦) where 𝑥 ∈ R𝑑 is the feature vector and y
maximizing the above objective function. In equation 1, E𝑝 denotes is the label.
a single feature from the elicitable set E that is considered for eval- The training instances E are clustered based on just the observed
uation based on the subset 𝑋 grown so far. The first term is the feature set O using K-means clustering (Cluster). For every cluster
mutual information between each feature and the class variable 𝑌 . 𝑐𝑖 , the training instances belonging to the cluster is assigned to
In a discriminative task, this value should be maximized. The sec- the set E𝑐𝑖 and is passed to the Feature Selector module (lines 6-8).
ond term is the pairwise mutual information between each feature The FeatureSelector function takes E𝑐𝑖 , parameter 𝛼, the feature
to be evaluated and the features already added to the feature subset subsets O and E, cost vector 𝑀 and a predefined budget 𝐵 as input
𝑋 . This value needs to be minimized for selecting informative fea- and returns the most important feature subset X𝑐𝛼𝑖 corresponding
tures. The third term is called the conditional redundancy [1] and to a cluster 𝑐𝑖 . A greedy optimization technique is used to grow
this term needs to be maximized. The last term adds the penalty the feature subset 𝑋 of every cluster based on the feature selector
for cost of every feature and ensures the right trade-off between objective function defined in Equation 1. The FeatureSelector
cost, relevance and redundancy. For this work, we do not learn the terminates once the budget 𝐵 is exhausted or the mutual informa-
parameters 𝛼𝑐𝑖 for each cluster, instead fix these parameters to 1. tion score becomes negative. Once all the important feature subsets
We leave the learning of these parameters to future work. are obtained for all the |𝐶 | clusters, the model objective function is
In the problem setup, since the 0 cost feature subset is always optimized as mentioned in Equation 3 for all the training instances
present, we always consider the observed feature subset O in ad- using the important feature subsets for the clusters to which the
dition to the most important feature subset as returned by the training instances belong (lines 12-18). All the remaining features
Feature selector objective. We also account for the knowledge of are imputed by using either 0 or any other imputation model be-
the observed features while growing the informative feature subset fore training the model. The final training model G(E O∪𝑋𝛼 , 𝛼, 𝜃 )
through greedy optimization. Specifically, while calculating the is an unified model used to make predictions for a test-instance
pairwise mutual information between the features and the condi- consisting of just the observed feature subset O.
tional redundancy term (second and third term of equation 1), we
also evaluate the mutual information of the features with these 4 EMPIRICAL EVALUATION
observed features. It is to be noted that in cases where the observed We did experiments with 3 real world medical data sets. The in-
features are not discriminative enough of the target, the feature se- tuition of CAFE makes more sense in medical domains, hence our
lector module ensures that the elicitable features with maximum choice of data sets. However, the idea can be applied to other do-
relevance to the target variable are picked. mains ranging from logistics to resource allocation task. Table 2
Optimization Problem: The cost aware feature selector jots down the various features of the data sets used in our experi-
𝐹 (𝑋, E𝑐𝑖 , 𝛼) for a given set of instance E𝑐𝑖 belonging to a specific ments. Below are the details of the 3 real data sets, we use for our
cluster 𝑐𝑖 solves the following optimization problem: experiments.

𝑋𝛼𝑖 = argmax𝑋 ⊆ E 𝐹 (𝑋, E𝑐𝑖 , 𝛼) (2) 1. Parkinson’s disease prediction: The Parkinson’s Progression
Marker Initiative (PPMI) [12] is an observational study where the
For a given instance (𝑥, 𝑦), we denote 𝐿(𝑥, 𝑦, 𝑋, 𝜃 ) as the loss aim is to identify Parkinson’s disease progression from various
function using a subset 𝑋 of the features as obtained from the types of features. The PPMI data set consists of various features
Feature selector optimization problem. The optimization problem related to various motor functions and non-motor behavioral and
for learning the parameters of a classifier can be posed as: psychological tests. We consider certain motor assessment features
𝑛 like rising from chair, gait, freezing of gait, posture and postural sta-
bility as observed features and rest all features as elicitable features
Õ
min 𝐿(𝑥𝑖 , 𝑦𝑖 , 𝑋𝛼𝑖 , 𝜃 ) + 𝜆1𝑐 (𝑋𝛼𝑖 ) + 𝜆2 ||𝜃 || 2 (3)
𝜃
𝑖=1 which must be acquired at a cost.
KiML’20, August 24, 2020, San Diego, California, USA,
Srijita Das, Rishabh Iyer, and Sriraam Natarajan

Algorithm 1 Cost Aware Feature Elicitation
1: function CAFE(E, O, E, 𝑀, 𝐵)
2: E = E O∪E ⊲ E consists of 0 cost features O and costly
features E
3: 𝐶 = Cluster(E O ) ⊲ Clustering based on the observed
features O
4: X = {∅} ⊲ Stores best feature subsets of each cluster
5: for 𝑖 = 1 to |𝐶 | do ⊲ Repeat for every cluster
6: E𝑐𝑖 = GetClusterMember(E, 𝐶, 𝑖)
7: ⊲ get the data points belonging to each cluster 𝑐𝑖
8: X𝑐𝛼𝑖 = FeatureSelector(E𝑐𝑖 , 𝛼, O, E, 𝑀, 𝐵)
9: ⊲ Parameterized feature selector for each cluster
10: X = X ∪ {X𝑐𝛼𝑖 ∪ O}
11: end for
12: for 𝑖 = 1 to |𝐶 | do ⊲ Repeat for every cluster
13: X𝑐𝛼𝑖 = GetFeatureSubset(X, 𝑖) Figure 2: Recall Vs number of clusters for Rare disease for
14: ⊲ Get the feature subset for each cluster 𝑐𝑖 CAFE-I
15: for 𝑗 = 1 to |E𝑐𝑖 | do ⊲ Repeat for every data point in
cluster 𝑐𝑖
16: Optimize 𝐽 (𝑥 𝑗 , 𝑦 𝑗 , X𝑐𝛼𝑖 , 𝜃, 𝑀) and built upon it. We consider two variants of CAFE:(1) CAFE in
17: ⊲ Optimize the objective function in Equation 3 which we replace the missing and unimportant features of every
18: Update 𝜃 ⊲ Update the model parameter 𝜃 cluster with 0 and then update the classifier parameters (2) CAFE-I
19: end for where we replace the missing and unimportant features by using an
20: end for imputation model learnt from the already acquired feature values
return G(E O∪𝑋𝛼 , 𝛼, 𝜃 ) ⊲ G is the training model built on E of other data points. A simple imputation model is used where we
21: end function replace the missing features with mode for categorical features and
mean for numeric features.
Baselines: We consider 3 baselines for evaluating CAFE and
2. Alzheimer’s disease prediction: The Alzheimer’s Disease Neu- CAFE-I: (1) using the observed and zero cost features to update
roIntiative (ADNI1 ) is a study that aims to test whether various the training model denoted as OBS (2) using a random subset of
clinical, FMRI and biomarkers can be used to predict the early onset fixed number of elicitable features and all the observed features
of Alzheimer’s disease. In this data set, we consider the demograph- to update the training model denoted as RANDOM. For this baseline,
ics of the patients as observed and zero cost features and the FMRI the results are averaged over 10 runs. (3) using the information
image data and cognitive score data as unobserved and elicitable theoretic feature selector score as defined in Equation 1 to select
features. the ’k’ best elicitable features on the entire data without any cluster
3. Rare disease prediction This data set is created from survey consideration along with the observed features denoted as KBEST.
questionnaires [11] and the task here is to predict whether a person We keep the value of ’k’ to be the same as that used by CAFE.
has rare disease or not. The demographic features are observed Although some of the existing methods could be potential baselines,
while other sensitive questions in the survey regarding technology none of these methods match the exact setting of our problem, hence
use, health and disease related meta information is considered to we do not compare our method against them.
be elicitable. Results: We aim to answer the following questions:
Evaluation Methodology: All the data sets were partitioned Q1: How does CAFE and CAFE-I with hard budget on features
into a 80:20 train-test split. Hyper parameters like the number of compare against the standard baselines?
clusters on the observed features were picked by doing 5 fold cross Q2: How does the cost-sensitive version of CAFE and CAFE-I
validation on all the data sets. The optimal number of clusters fare against the cost-sensitive baseline KBEST?
picked were 6 for ADNI, 9 for Rare disease data set and 7 for the
PPMI data set. For the results reported in Table 1, we considered a The results reported in Table 1 suggests both CAFE and CAFE-
hard budget on the number of elicitable features and set it to half I significantly outperform the other baselines in almost all the
of the total number of features in the respective data set. We use K- metrics for Rare disease and PPMI data set. For ADNI, CAFE and
means clustering as the underlying clustering algorithm. For all the CAFE-I outperform the other baselines in clinically relevant recall
reported results, we use an underlying Support Vector Machine [3] metric while KBEST performs the best for the other metrics. The
classifier with Radial basis kernel function. Since, all the data sets reason for this is that in ADNI, since, the elicitable features are
are highly imbalanced, hence we consider metrics like recall, F1, image features and we discretize the image features to calculate
AUC-ROC and precision for our reported results. For the Feature the information gain for the feature selector module, the granular
selector module, we used the existing implementation of Li et al. [7] level feature information is lost because of this discretization and
hence the drop in performance. For the experiments in Table 1,
1 www.loni.ucla.edu/ADNI we keep the budget to be approximately half of the total number
KiML’20, August 24, 2020, San Diego, California, USA,
Cost Aware Feature Elicitation

Data set Algorithm Recall F1 AUC-ROC AUC-PR
OBS 0.647 0.488 0.642 0.347
RANDOM 0.57 ± 0.064 0.549± 0.059 0.693 ± 0.042 0.421 ± 0.051
Rare disease KBEST 0.47 0.457 0.628 0.349
CAFE 0.647 0.628 0.749 0.489
CAFE-I 0.647 0.647 0.759 0.512
OBS 0.765 0.685 0.741 0.563
RANDOM 0.857 ± 0.023 0.809 ± 0.015 0.85 ± 0.013 0.712 ± 0.020
PPMI KBEST 0.828 0.807 0.846 0.716
CAFE 0.846 0.817 0.855 0.726
CAFE-I 0.855 0.829 0.865 0.743
OBS 0.5 0.44 0.553 0.365
RANDOM 0.711 ± 0,043 0.697 ± 0.082 0.767 ± 0.064 0.592 ± 0.098
ADNI KBEST 0.73 0.745 0.806 0.646
CAFE 0.807 0.711 0.786 0.578
CAFE-I 0.769 0.701 0.776 0.574
Table 1: Comparison of CAFE against other baseline methods on 3 real data sets

Dataset # Pos # Neg # Observed # Elicitable 5 CONCLUSION
PPMI 554 919 5 31
ADNI 94 287 6 69
In this paper, we pose the prediction time feature elicitation problem
Rare Disease 87 232 6 63 as an optimization problem by employing a cluster specific feature
Table 2: Data set details of the 3 real data sets used.#Pos is num- selector to choose the best feature subset and then optimizing the
ber of positive example, #Neg is number of negative example. # Ob- training loss. We show the effectiveness of our approach in real data
served is number of observed features and # Elicitable is the maxi- sets where the problem set up is intuitive. Future work includes
mum number of features that can be acquired. learning the parameters of the feature selector module and jointly
optimizing the feature selector and model parameters for a more
robust framework and adding more constraints to optimization.

of features for all the methods. On an average, CAFE-I performs ACKNOWLEDGEMENTS
better than CAFE across all the data sets because of the underlying
SN & SD gratefully acknowledge the support of NSF grant IIS-
imputation model which helps in better treatment of the missing
1836565. Any opinions, findings and conclusion or recommenda-
values as against replacing all the features by 0. This answers Q1
tions are those of the authors and do not necessarily reflect the
affirmatively.
view of the US government.
In Figure 3, we compare the cost version of CAFE and CAFE-I
against KBEST. Cost version takes into account the cost of individ-
ual features and accounts for them as penalty in the feature selector REFERENCES
[1] Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. 2012. Conditional
module. Hence, in this version of CAFE, a cost budget is used as likelihood maximisation: a unifying framework for information theoretic feature
opposed to hard budget on the number of elicitable features. We gen- selection. JMLR (2012).
erate the cost vector by sampling each cost component uniformly [2] Xiaoyong Chai, Lin Deng, Qiang Yang, and Charles X Ling. 2004. Test-cost
sensitive naive bayes classification. In ICDM.
from (0,1). For PPMI and Rare disease, we can see that cost sensitive [3] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine
CAFE performs consistently better than KBEST with increasing learning (1995).
cost budget. In the PPMI data set, the greedy optimization of the [4] Gabriel Dulac-Arnold, Ludovic Denoyer, Philippe Preux, and Patrick Gallinari.
2011. Datum-wise classification: a sequential approach to sparsity. In ECML
feature selector objective on the entire data set lead to elicitation of PKDD. 375–390.
just 1 feature, beyond that the information gain was negative, hence [5] Tianshi Gao and Daphne Koller. 2011. Active classification based on value of
classifier. In NIPS.
the performance of PPMI across various cross budget remains the [6] P. Kanani and P. Melville. 2008. Prediction-time active feature-value acquisition
same. CAFE on the other hand was able to select important feature for cost-effective customer targeting. Workshop on Cost Sensitive Learning at
subsets for various clusters based on the observed features related NIPS (2008).
[7] Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino,
to gait and postures. For ADNI data set, CAFE performs better than Jiliang Tang, and Huan Liu. 2018. Feature selection: A data perspective. ACM
KBEST only in recall. The reason for this is the same as mentioned Computing Surveys (CSUR) (2018).
above. This helps in answering Q2 affirmatively. [8] Charles X Ling, Qiang Yang, Jianning Wang, and Shichao Zhang. 2004. Decision
trees with minimal costs. In ICML.
Lastly, Figure 2 shows the effect of increasing cluster on the [9] D. J. Lizotte, O. Madani, and R. Greiner. 2003. Budgeted learning of Naive-Bayes
validation recall for the Rare disease data set. As can be seen, for classifiers (UAI). 378–385.
[10] Chao Ma, Sebastian Tschiatschek, Konstantina Palla, Jose Miguel Hernandez-
smaller number of clusters, the recall is very low and increases to Lobato, Sebastian Nowozin, and Cheng Zhang. 2019. EDDI: Efficient Dynamic
an optimum for 9 clusters. This helps us in understanding the fact Discovery of High-Value Information with Partial VAE. In ICML.
that forming clusters based on observed important features helps [11] H. MacLeod, S. Yang, et al. 2016. Identifying rare diseases from behavioural data:
a machine learning approach (CHASE). 130–139.
CAFE in selecting different feature subsets for different clusters, [12] K. Marek, D. Jennings, et al. 2011. The Parkinson Progression Marker Initiative
thus helping the learning procedure. (PPMI). Prog Neurobiol 95, 4 (2011), 629–635.
KiML’20, August 24, 2020, San Diego, California, USA,
Srijita Das, Rishabh Iyer, and Sriraam Natarajan

Figure 3: Recall (left), F1 (middle), AUC-PR (right) for (from top to bottom) Rare Disaese, PPMI, and ADNI. The x-axis refers
to the cost budget used which leads to the elicitation of different number of features.

[13] P. Melville, M. Saar-Tsechansky, et al. 2004. Active feature-value acquisition for (2005), 1226–1238.
classifier induction (ICDM). 483–486. [22] Thomas Rückstieß, Christian Osendorfer, and Patrick van der Smagt. 2011. Se-
[14] P. Melville, M. Saar-Tsechansky, et al. 2005. An expected utility approach to quential feature selection for classification. In Australasian Joint Conference on
active feature-value acquisition (ICDM). 745–748. Artificial Intelligence. Springer, 132–141.
[15] Feng Nan and Venkatesh Saligrama. 2017. Adaptive classification for prediction [23] M. Saar-Tsechansky, P. Melville, and F. Provost. 2009. Active feature-value
under a budget. In NIPS. acquisition. Manag Sci 55, 4 (2009).
[16] Feng Nan, Joseph Wang, and Venkatesh Saligrama. 2015. Feature-budgeted [24] Victor S Sheng and Charles X Ling. 2006. Feature value acquisition in testing: a
random forest. In ICML. sequential batch test algorithm. In ICML.
[17] Feng Nan, Joseph Wang, and Venkatesh Saligrama. 2016. Pruning random forests [25] Hajin Shim, Sung Ju Hwang, and Eunho Yang. 2018. Joint active feature acquisi-
for prediction on a budget. In NIPS. tion and classification with variable-size set encoding. In NIPS.
[18] Feng Nan, Joseph Wang, Kirill Trapeznikov, and Venkatesh Saligrama. 2014. Fast [26] Joseph Wang, Kirill Trapeznikov, and Venkatesh Saligrama. 2015. Efficient learn-
margin-based cost-sensitive classification. In ICASSP. ing by directed acyclic graph for resource constrained prediction. In NIPS.
[19] Sriraam Natarajan, Srijita Das, Nandini Ramanan, Gautam Kunapuli, and Predrag [27] Zhixiang Xu, Matt Kusner, Kilian Weinberger, and Minmin Chen. 2013. Cost-
Radivojac. 2018. On Whom Should I Perform this Lab Test Next? An Active sensitive tree of classifiers. In ICML.
Feature Elicitation Approach.. In IJCAI. [28] Zhixiang Xu, Kilian Q Weinberger, and Olivier Chapelle. 2012. The greedy miser:
[20] S. Natarajan, A. Prabhakar, et al. 2017. Boosting for postpartum depression learning under test-time budgets. In ICML.
prediction (CHASE). 232–240. [29] Z. Zheng and B. Padmanabhan. 2002. On active learning for data acquisition
[21] Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based (ICDM). 562–569.
on mutual information criteria of max-dependency, max-relevance, and min-
redundancy. IEEE Transactions on pattern analysis and machine intelligence 27, 8
A New Delay Differential Equation Model for COVID-19
Retarded logistic equation
B Shayak† Mohit M Sharma Manas Gaur
Mechanical and Aerospace Engg Population and Health Sciences AI Institute
Cornell University Weill Cornell Medicine University of South Carolina
Ithaca, New York State, USA New York City, USA USA
sb2344@cornell.edu mos4004@med.cornell.edu mgaur@email.sc.edu

ABSTRACT (homogeneous mixing etc). The second option affords maximum
potential versatility at the cost of huge computational complexity
In this work we give a delay differential equation, the retarded and variability in the network structure. The third option
logistic equation, as a mathematical model for the global combines features of the previous two – whether the features
transmission of COVID-19. This model accounts for asymptomatic being synergized are the positive or the negative ones depends to
carriers, pre-symptomatic or latent transmission as well as contact a large extent on the modeler.
tracing and quarantine of suspected cases. We find that the In this work we use delay differential equations (DDE) to
equation admits varied classes of solutions including self-burnout, propose a simple, single-variable, lumped parameter model for the
progression to herd immunity and multiple states in between. We spread of Coronavirus. Jahedi and Yorke [1] make a strong case
use the term “partial herd immunity” to refer to these states, for simpler models relative to complex and elaborate ones. In the
where the disease ends at an infection fraction which is not Literature, DDE has been used for modeling COVID-19, for
negligible but is significantly lower than the conventional herd example in Refs. [2]–[4]. These authors however ignore features
immunity threshold. We believe that the spread of COVID-19 in such as contact tracing, asymptomatic carriers and latent
every localized area can be explained by one of our solution transmission; our results too have a richer structure.
classes.

CCS CONCEPTS 2 Derivation of the model
• Applied computing – mathematics and statistics We measure time t in days and use as our basic variable y(t)
which is the cumulative number of corona cases, including active
KEYWORDS cases, recovered cases and deaths, in the region of interest. The
following “word-equation” summarizes the approach :
Retarded logistic equation, Asymptomatic carriers, Latent
transmission, Contact tracing, Reproduction number calculation,  Rate of emergence  =  Interaction rate of  
   
Partial herd immunity
 of new cases   each existing case 
(0)
1 Introduction  Probability of    Number of 
   
Three kinds of models to study COVID-19 are currently in  transmission   existing cases 
vogue – lumped parameter or compartmental models (ordinary The left hand side (LHS) here is just dy/dt whereas the right
differential equation), agent-based models and stochastic hand side (RHS) needs a detailed derivation.
differential equation models. The first option affords maximum Equation (0) assumes that the disease is transmitted from
conceptual clarity at the expense of some simplifying assumptions infected to susceptible people via interaction, and not via airborne
†Presenting author, Corresponding author. ORCID : 0000-0003-2502-2268 transmission. Due to asymptomatic and pre-symptomatic carriers,
there are always cases moving about in society who are oblivious
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the to their infectivity. Each such case interacts with other people at
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San a different rate. For example, a working-from-home professor
Diego, California, USA, August 24, 2020. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). might venture outside once every three days and interact with one
person on each trip while a grocer might go to work and interact
KiML'20, San Diego, California, USA with 10 customers every day. The professor has an interaction rate
© 2020, Copyright held by the author(s).
of 1/3 persons/day while the grocer has interaction rate of 10
KDD KiML 2020 Shayak et. al.

persons/day. For a compartmental model, one must average over contact tracing drive conducted by public health department is
the professor, the grocer and all the other un-quarantined cases to taken into account. Assumption is made that this drive is
generate an effective per-case interaction rate q0. instantaneous and proceeds in forward direction starting from
Every interaction of course does not result in a transmission – freshly arriving symptomatic cases. The contact trace captures
there is a probability strictly less than unity that the virus jumps patients who were exposed to the new case τ2 days ago, as well as
from the infected person to the person whom s/he is interacting patients who were exposed immediately before the new case
with. This probability has two components. The first component manifested symptoms. The average duration for which these
is that the healthy person must be susceptible to begin with. While secondary patients have remained at large is τ2/2, be they
we ignore intrinsic insusceptibles, there will be people who have symptomatic or asymptomatic. The assumption of instantaneous
recovered from the disease and are therefore not susceptible contact tracing, which decreases the average time that contact-
again. In this Article, we assume that one bout of infection brings traced cases spend out of quarantine, opposes the error arising
permanent immunity. The assumption is valid so long as the from the assumption of zero non-transmissible incubation period,
immunity period exceeds the total epidemic duration. Till date, which increases the average time for which the contact-traced
there is little credible evidence for re-infection [5]–[7]; contrarily, cases transmit before quarantine. These two effects are assumed
a very recent and thorough study [8] based on monitoring of huge here to cancel. Let μ3 (between 0 and 1) denote the fraction of all
patient cohort has found significant evidence of long-lasting and cases who escape from contact tracing drives – the
effective antibodies. If N be the initial number of susceptible complementary fraction 1−μ3 get caught. Thus, we have three
people (recall that y is the case count), then the probability that a classes of un-quarantined cases : (a) 1−μ3 are contact-traced cases
random person is a recovered case is approximately y/N and the who remain in society for a time τ2/2, (b) μ3 (1−μ1) are untraced
probability that s/he is susceptible is (approximately) 1−y/N. This symptomatic cases who go into isolation only after time τ2, and (c)
expression is approximate because the true number of recovered μ3μ1 are undetected asymptomatic cases who transmit for the
cases at any time is less than y; the error however is small since entire infection period τ1. Arguments similar to those of the
the recovery period is much shorter than the overall course of the previous paragraph yield the total number of un-quarantined
epidemic. Note that 1−y/N is a logistic term, and a herd immunity cases as
effect.
Given susceptibility, the next probability is that the virus n = ( 1 − μ3 )( y − y ( t − τ 2 / 2) ) +
actually does jump from the un-quarantined case to the . (1)
susceptible person. This probability depends on the level of ( (1 − μ ) μ )( y − y ( t − τ ) ) + μ μ ( y − y ( t − τ ) )
1 3 2 1 3 1

precaution such as face covering or mask, handwashing and The preceding arguments now yield the mathematical form of
disinfection being adopted by the case as well as the susceptible (0) as
person. For a compartmental model, the probability must be
averaged over all the un-quarantined cases. If this average dy  y   y ( t ) − ( 1 − μ3 ) y ( t − τ 2 / 2) −

= m0 1 −
probability is P0, then q0(1−y/N)P0 gives the per-case spreading dt   (1 − μ ) μ y ( t − τ ) − μ μ y ( t − τ ) 
N 1 3 2 1 3 1
rate. Since q0 and P0 are both dependent on public health
measures, and are both difficult to measure independently, we can (2)
club those two together into a single parameter which we call m0. which is the retarded logistic equation.
So far we have accounted for the rate at which each cases
spreads the disease; now we have to count the number of cases 3 Solutions of the model
out of quarantine. Let us start with an asymptomatic carrier, who
remains in open society throughout. S/he typically transmits the Due to complexity of the equation (2), analytical solution using
disease for 7 days, which is called the infection period. Then, new perturbation theory etc has not been attempted in this case.
healthy people can be only be infected by those asymptomatic Instead we have used numerical integration to obtain the
cases who have fallen sick within the last 7 days, and not those solutions of (2). Before giving the solutions however, we present
who have fallen sick earlier. The number of such people is the the calculation of the reproduction number R. To find R at any
number of asymptomatic sick people today minus the number of state of evolution of the disease, we first treat y in the logistic term
those 7 days earlier. Mathematically, let μ1 (between 0 and 1) to be constant, and then carry out the steps described in Ref. [9].
This yields the expression
denote the fraction of asymptomatic carriers and τ1 the
asymptomatic infection period. Then, the number of
asymptomatic transmitters today is μ1(y(t)−y(t−τ1)). Here we can
see the emergence of the delay term.
R = m0 1 − ( )(
N
y 1 + μ3 − 2 μ1 μ3

2
τ 2 + μ1 μ3 τ 1 )
. (3)

The ease of calculating R with respect to the ordinary
The remaining fraction 1−μ1 of cases are symptomatic. Let τ2
differential equation based models [10] is noteworthy.
be the latency period during which these cases remain
Solution classes of logistic DDE (2) are now demonstrated. The
transmissible prior to displaying symptoms. It is assumed that
numerical integration routine used is second order Runge Kutta
they isolate themselves thereafter. Assumption is also made that
with a time step of 1/1000 day. As the testbed for the simulations,
the incubation period is equal to the latency period. Finally, the
we consider a Notional City having N=300000, μ1=0.8, (maximum
Retarded logistic equation KDD KiML 2020

value as per our knowledge [11]–[13]), τ1=7 days and τ2=3 days there spiraled out of control despite hard lockdowns being
[14]. The initial condition needs to be a function having the length imposed at an early stage.
of the maximum delay involved in the problem, which is seven City B also enables us to explain partial herd immunity. Even
days; we take this function to be zero cases to start with and though the initial conditions were unfavourable for containment
constant increase of 100 cases/ day for a week. of the epidemic, herd immunity started activating as the disease
Notional City A has m0=0.23 and μ3=1/2, which describes a proliferated. A stable zone (R<1) was entered when only 13.5
hard lockdown [15] accompanied by good contact tracing. R0 (i.e. percent of the total susceptible population was infected, and a
(3) evaluated at y=0) is 0.886. The epidemic ends with a negligible similar percentage again got infected before the epidemic ended.
fraction of infected people, as shown below. This and the next five Thus, herd immunity worked in synergy with non-
plots are three-way – each plot shows y as blue line, its derivative pharmaceutical interventions to stop the epidemic at only 26
y as green line and the weekly increments in cases, or percent infection level, which is significantly less than the
epidemiological curve, as a grey bar chart. These last have been conventional 70-90 percent threshold [16]. This is what we call
reduced by a factor of 7 to ensure clarity of presentation. We partial herd immunity. Our findings are in agreement with and act
report the rates on the left hand side y-axis and the cumulative as an explanation for what has been obtained by Britton et. al. [17]
cases on the right hand side y-axis. and Peterson et. al. [18].
We now consider Notional City C which differs from City B in
that m0=0.5; lockdown is replaced by a much more permissive
state. R0 is above 2.5; 1,80,000 infections are required to bring it
below unity.

Figure 1 : City A extinguishes the epidemic in time.

This is exactly what has happened in New Zealand – that il
fortunatissimo per verita has indeed quashed the epidemic
completely with the final case count being a negligible fraction of Figure 3 : City C goes to herd immunity – total not
its total (tiny and sparsely distributed) population. partial. The symbol ‘k’ denotes thousand and ‘L’ hundred
The parameter values for Notional City B are the same as those thousand.
for A except that μ3=0.75; a greater fraction of cases escape the
contact tracing drive. R0 is 1.16, and R becomes 1 at y=40500 cases. Need one mention that this is a public health disaster. Notional
City D combines features of B and C. This city begins with m0=0.5
like City C but reduces to m0=0.23 like City B when the case count
reaches 40,000 (the R=1 threshold for B’s parameters).

Figure 2 : City B grows at first before reaching burnout.
The symbol ‘k’ denotes thousand.

The outbreak enters exponential regime right after being Figure 4 : As the input, so the output – D’s response
released. As y increases, R gradually reduces so the growth slows combines features of B and C. The symbol ‘k’ denotes
down until it peaks when the case count is about 39,000 [compare thousand and ‘L’ hundred thousand.
with the value of 40,500 when R=1 as per (3)]. Thereafter, the
disease progresses to extinction in time. The overall progression We can see a case count as well as a total duration intermediate
is very long but one hopes that the relatively small size of the peak to B and C; the epidemic is over in 70 days but the peak rate of
can prevent overstressing of medical care facilities and thus avoid 12,920 cases/day is still very high and likely to load hospital
unnecessary deaths. Delhi and Mumbai in India and Los Angeles facilities beyond their carrying capacity.
in USA are in all probability cities of this type since the disease The Cities E and F demonstrate the issues faced in reopening.
In both these cities, the parameters and case trajectory are
KDD KiML 2020 Shayak et. al.

identical to those of City A for the first 80 days. Then, E and F forecasting transmission and control of COVID-19,” medRxiv, p.
2020.05.06.20092858, 2020, doi: 10.1101/2020.05.06.20092858.
reopen on the 80th day by increasing m0 from 0.23 to 0.5, and [4] J. Mendenez, “Elementary time-delay dynamics of COVID-19 disease,”
simultaneously decreasing μ3 i.e. deploying a more effective Medrxiv, pp. 1–4, 2020, doi: https://doi.org/10.1101/2020.03.27.20045328.
[5] D. C. Ackerly, “Getting COVID-19 twice.” VOX, [Online]. Available:
contact tracing program which had been built up during the https://www.vox.com/2020/7/12/21321653/getting-covid-19-twice-
lockdown. The post-reopening μ3’s for E and F are 0.1 and 0.2 reinfection-antibody-herd-immunity.
respectively. [6] S. McCamon, “13 USS Roosevelt Sailors Test Positive For COVID-19,
Again.”
[7] Y. Saplakoglu, “coronavirus-reinfections-were-false-positives.” [Online].
Available: https://www.livescience.com/coronavirus-reinfections-were-
false-positives.html.
[8] A. Wajnberg et al., “SARS-CoV-2 infection induces robust, neutralizing
antibody responses that are stable for at least three months,” medRxiv,
2020, doi: https://doi.org/10.1101/2020.07.14.20151126.
[9] B. Shayak and R. H. Rand, “Self-burnout - A New Path to the End of
COVID-19,” medRxiv, pp. 1–14, 2020, doi:
https://doi.org/10.1101/2020.04.17.20069443.
[10] O. Diekmann, J. A. P. Heesterbeek, and M. G. Roberts, “The construction
of next-generation matrices for compartmental epidemic models,” J. R.
Soc. Interface, vol. 7, no. 47, pp. 873–885, 2010, doi: 10.1098/rsif.2009.0386.
[11] “71 percent of patients in Maharashtra are asymptomatic.” Mumbai
Figure 5 : City E, like City A, is a success story. Mirror, [Online]. Available:
https://mumbaimirror.indiatimes.com/coronavirus/news/covid-19-71-of-
patients-in-maharashtra-are-asymptomatic-mumbai-cases-at-
16579/articleshow/75754328.cms.
[12] “Taking over hospital beds, conducting survey.” New Indain Express,
[Online]. Available:
https://www.newindianexpress.com/nation/2020/may/30/taking-over-
hospital-beds-conducting-survey-uddhav-government-goes-after-covid-
19-as-state-tally-c-2149989.html.
[13] “Delhi CM says COVID-19 deaths very less.” Times of India, [Online].
Available: https://timesofindia.indiatimes.com/city/delhi/delhi-cm-says-
covid-19-deaths-very-less-but-75pc-cases-asymptomatic-or-showing-
mild-symptoms/articleshow/75658636.cms.
[14] M. L. Childs et al., “The impact of long-term non-pharmaceutical
interventions on COVID-19 epidemic dynamics and control,” medRxiv,
vol. 22, p. 2020.05.03.20089078, 2020, doi: 10.1101/2020.05.03.20089078.
[15] B. Shayak and M. M. Sharma, “Retarded Logistic Equation as a Universal
Figure 6 : Unlike City E, F is a failure story. The symbol Dynamic Model for the Spread of COVID-19,” medRxiv, pp. 1–27, 2020,
‘k’ denotes thousand and ‘L’ hundred thousand. doi: 10.1101/2020.06.09.20126573.
[16] G. A. D’Souza and D. Dowdy, “What is herd immunity and how we can
achieve it with COVID-19 ?” [Online]. Available:
The difference between Cities E and F is dramatic. https://www.jhsph.edu/covid-19/articles/achieving-herd-immunity-
Mathematically, R remained less than unity throughout in E; its with-covid19.html.
[17] T. Britton, F. Ball, and P. Trapman, “The disease-induced herd immunity
value after reopening was 0.985. We can see that the case rate level for Covid-19 is substantially lower than the classical herd immunity
decreases monotonically all the time. In F, the post-reopening R level,” pp. 1–15, 2020, [Online]. Available:
http://arxiv.org/abs/2005.03085.
became 1.22 and sent the trajectory haywire. In practice however, [18] A. A. Peterson, C. F. Goldsmith, C. Rose, A. J. Medford, and T. Vegge,
the incipient increase in case rate after the 80 th day acts as an “Should the rate term in the basic epidemiology models be second-
advance warning of what has happened – the reopening steps order?,” 2020, [Online]. Available: http://arxiv.org/abs/2005.04704.

should be reversed if it is at all possible to do so while satisfying
economic and other external constraints.

Conclusion
In this Article we have presented a new mathematical model
for COVID-19 which is simple and elegant in structure but can
generate a variety of realistic solution classes. We hope that our
work may be of use to mathematicians and data scientists who are
trying to understand the spread of the disease in a quantitative
manner. The public health implications of these results are being
reserved for another study.

REFERENCES
[1] S. Jahedi and J. A. Yorke, “When the best pandemic models are the
simplest .,” medRxiv, pp. 1–22, 2020, doi:
https://doi.org/10.1101/2020.06.23.20132522.
[2] L. Dell’Anna, “Solvable delay model for epidemic spreading: the case of
Covid-19 in Italy,” 2020, [Online]. Available:
http://arxiv.org/abs/2003.13571.
[3] A. K. Gupta, N. Sharma, and A. K. Verma, “Spatial Network based model
Public Health Implications of a delay differential equation model for COVID 19
Mohit M Sharma B Shayak
Population and Health Sciences Sibley School of Mechanical and
Weill Cornell Medicine Aerospace Engineering
New York City, USA Cornell University
Ithaca, New York State, USA

mos4004@med.cornell.edu sb2344@cornell.edu

ABSTRACT 175,000 by August 15th, 2020 [3]. Some features however, both
nationally and globally, have proved counterintuitive. For
This paper describes the strategies derived from a novel delay example, a 76-day lockdown resulted in the outbreak’s
differential equation model[1], signifying a practical extension containment in Wuhan. A similar measure has produced similar
of our recent work. COVID -19 is an extremely ferocious and an results in New Zealand. However, lockdown appeared only
unpredictable pandemic which poses unique challenges for marginally effective in New York State, USA where the case and
public health authorities, on account of which “case races” death counts decreased only after reaching horrifying peak levels
among various countries and states do not serve any purpose and [4]. It was contended that the stay at home order in New York
present delusive appearances while ignoring significant came too late. This apparent delay was not present in California,
determinants. We aim to propose comprehensive planning USA. The case counts there went up all the same, and the rate is
guidelines as a direct implication of our model. Our first high even today. We would like to mention that such
consideration is reopening, followed by effective contact tracing spatiotemporal anomalies are present not just in the US but also
and ensuring public compliance. We then discuss the in other countries such as Canada, Russia and India [5] which
implications of the mathematical results on people’s behavior witnessed high case growth despite being in lockdown. In order
and eventually provide conclusive points aimed at strengthening to better understand the epidemiology of the transmission of
the arsenal of resources that are helpful in framing public health COVID-19, we have constructed a delay differential equation
policies. The knowledge about pandemic and its association with model. Here we present its practical implications which tries to
public health interventions is documented in the various encapsulate a myriad of factors associated with the current
literature-based sources. In this study, we explore those resources scenario.
to explain the findings inferred from delay differential equation
model of covid-19. 2 MATHEMATICAL MODELING TO
UNDERSTAND THE EPIDEMIOLOGY
KEYWORDS
Since many decades, mathematical modelling has been used
Delay differential equation, Contact tracing, Socio-behavioral as an integral tool in recognizing the trend of disease progression
theories, Lockdown, Reopening during pandemics. For example, using a simple model explaining
the transmission dynamics of the infectious disease between the
susceptible, infected and recovered population ( SIR Epidemic
Models) Kermack and McKendrick proposed and later
1 INTRODUCTION established a principle – the level of susceptibility in the
The national (USA) and global spread of Coronavirus Disease population should be adequately high in order for that epidemic
2019 (COVID-19), following its origins in Wuhan, China in at to unfold in that population. Such mathematical models can give
least December 2019 and possibly earlier still [2] has been impressionable insights in explaining the epidemiological status
alarmingly rapid and deadly. From the 25 individual national of the population, predict or calculate the transmissibility of the
forecasts received by CDC, predicts that there is possibility of pathogen and the potential impact of public health preventive
the total reported COVID -19 deaths is between 160,000 and

In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the Workshop on Knowledge-infused Mining and
Learning (KDD-KiML 2020). San Diego, California, USA, August 24, 2020. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
KiML'20, San Diego, California, USA
© 2020, Copyright held by the author(s).
KDD KiML 2020 Sharma et al.

practices [6]. However, a significant body of evidence
suggests that decisions should be made regarding the parameters
to be included, being contingent on the impact of the precision of
predictions. Several policy questions about the containment of
this outbreak have been considered in our recently proposed
simple non-linear model [1]This paper delves into the practical
solutions that can be devised utilizing the directions of our
models’ outcome.

In generating interpretable results gathered from
epidemiological models, we have used the examples of six types
of cities [1]:

1) City A – Moderately effective contact
tracing in a hard lockdown. This city has R
(reproductive number) <1 and drives epidemic to
extinction in time.
2) City B – Less effective contact tracing in a
hard lockdown. It starts off R >1, but reached R =1 at
15% infection level. The epidemic ends at 30%
infection rate and takes a very long time to get there.
3) City C – Less effective contact tracing (Like
City B) with milder restriction on mobility. It proceeds
rapidly to herd immunity.
4) City D – Combination of City B and City C.
Starts with mild restriction on mobility and progresses
towards restriction. The duration of the epidemic as
well as of the final case count is between CITIES B
and C.
5) City E - Starts off like City A, it reopens with
very effective contact tracing and drive the epidemic to
extinction in time.
6) City F – Starts off like CITY A, it reopens
with less effective contact tracing and suffers a second
wave.

Pragmatic implications of our work are as follows:
KDD KiML 2020 Sharma et al.

3 REOPENING CONSIDERATIONS, ROLE Mr X as a potential case, having been exposed to a known case
OF TESTING yesterday. Then, it can be that Mr X contracts the virus ten days
from now, in which situation he will report negative if tested
The unemployment situation generated as a result of today or tomorrow, but will still amount to a spreading risk ten
lockdowns is currently forcing countries and states to partially days later if he is at large then. This also means that secondary
reopen their economies even though many of them have not yet contact tracing, i.e. finding Mr. X’s contacts, must go ahead
got the virus under control. The reopening is easiest in City A irrespective of his test results. Indeed, the medical authorities are
regions where cases have slowed down to a trickle. With every well aware of this loophole.
new case being detected, swift isolation of all potential
secondary, tertiary and maybe even quaternary cases, both The US Chamber of Commerce has given out state by state
forward and backward, should prove possible while the rest of reopening guides for small businesses which are mandated to be
the economy functions in a relatively uninhibited way. Even one followed across the US. Continued following of federal, state,
mass transmission event can restart an exponential growth tribal, territorial and local recommendations is of paramount
regime and force a rollback to a fully locked down state. importance.
Reopening beyond a skeletal level is impossible in City B regions
which are still in the ascending phase. The ascent implies that Prior to resuming work, all workplaces should have a
contact tracing is already inadequate, and on top of that if carefully chartered exposure control, mitigation and recovery
mobility increases then the region might turn into City C, plan. Although essential guidance is specific for each business,
overstress healthcare systems, and become a massacre. An there are certain measures that can be generally adopted across
ascending B-City has little option other than to contact trace as all workplaces.
hard as possible and wait for partial herd immunity to kick in.
1) Reopening in phases – The US government has laid down
Only when that happens and the cases slow down on their own
guidelines to open the country in 3 phases. First phase involves
can it consider a more extensive reopening like a City A region.
continuation of vulnerable individuals to remain at home. When
Testing is an important part of the epidemic management in public, people are expected to wear masks, have maximum
process no doubt since it enables the authorities to get an accurate physical separation, avoid places with more than 10 people and
description of the spread of the disease. As we have already limit non-essential travel. Second phase allows gatherings of 50
discussed, limited testing capacity is giving us a partial or people, some nonessential travel and reopening of schools. Third
distorted picture in many regions. There is a widespread media phase involves relaxation of restrictions, permitting vulnerable
perception that extensive testing is one of the prerequisites for populations to operate.
any kind of reopening process [7], [8]. Much criticism has also
2) Defining new metrics – Post-corona world will witness
been levelled at certain countries for having inadequate testing
some significant changes in regulatory controls, and behavioral
programs (we shall further elaborate the blame aspects later).
drift in personal and professional spheres. Cleanliness standards,
However, we would like to emphasize that testing is as of yet a
safety standards, infection prevention practices with regular
diagnostic tool and not a preventive one. Currently, it can show
monitoring and inspection for its assurance are some of the new
us how the disease is behaving but cannot slow its spread in any
terms that will have to be a part of a daily life of the people for
way. Test-induced slowing can come only when the capacity
at least the next few months.
expands to such a level as to be able to preventively test potential
super-spreaders such as grocers and food workers every single 3) Organizational changes – To help essential operations to
day. We hope that such a development may prove possible in the function, companies and organizations will have to be prepared
near future – many Universities for example are making with advanced IT systems (in case of continuation of remote
reopening arrangements with provision for very frequent testing working), supply of PPE, setting up travel facilities to avoid
of the entire community. public transport, providing behavioral health services, and leave
no stone unturned in overcoming biological, physical, and
During reopening it is vital to get a true picture of the disease
emotional challenges. We can see that the above guidelines are
evolution so that we can gauge the effect of any relaxation of
broadly conformal to our model predictions.
restrictions – whether it keeps the outbreak under control as in
City E or brings about the beginnings of a second wave as in City
4 METHODS OF CONTACT TRACING
F. Such beginnings are heralded by a rise in the case rate. As we
saw, there was no such rise in City E even though R increased As we have already mentioned, contact tracing is probably the
after the reopening. If the rise takes place, the relaxation must single most important factor in determining the progression of
immediately be rolled back to avert the disaster. Hence, during COVID-19 in a region. We can see from the model that the faster
reopening, the testing capacity must be high enough to detect the contact tracing takes place, the better; the more delay we
such incipient rises. As per China’s state media reports, with an have, the higher R becomes. Moreover, our model does not
aim to reopen the economy, the city of Wuhan conducted 6 account for backward contact tracing. In practice however, a
million tests in one week; we present this fact without discussion sufficiently high level of detection might not be possible to
or comment. A second reason why testing is still not all that it achieve with forward contact tracing alone. As much as it is
could have been is the high false-negative rate during the initial important, contact tracing is also one of the trickiest aspects to
stages of infection [9]. Suppose a contact tracing drive identifies handle since it can interfere with people’s privacy. In classical
KDD KiML 2020 Sharma et al.

contact tracing, human tracers talk to the confirmed cases and 2) Communicating the consequences involved with risky
track down their movements as well as the persons they behaviors in a transparent manner – Central and state ministers
interacted with over the past couple of days. This method has as well as public health authorities are in constant
worked well in Ithaca, USA and in Kerala, India. While it is the communication with the masses.
least invasive of privacy, it is also the most unreliable since
people might not remember their movements or their interactions 3) Conveying information about the steps involved in
correctly. The time taken in this method is also the maximum. A performing the recommended action and focusing on the benefits
more sophisticated variant of this supplements human testimony to action – Famous celebrities, in addition to state and central
with CCTV footage and credit/debit card transaction histories – governments, spread the messages explaining the required steps
this approach is possible only in countries such as USA where cogently and ensuring that it has the maximum reach, especially
card usage predominates over cash. The most sophisticated among social media-addicted millennials and similar
contact tracing algorithms use artificial intelligence together with populations.
location-tracking mobile devices and apps – while they are quick
4) Being open about the issues/barriers, identifying them at
and fool-proof, they automatically raise issues of privacy and
early stage and working toward resolution – Activating all sorts
security. For example, the TraceTogether app in Singapore,
of helpline numbers, email addresses, personal offices etc to
which worked very well during the initial phases of the outbreak,
address any grievances around the topic.
has not found popularity with many users [10]. Similarly, India’s
Aarogya Setu has also raised privacy concerns [11]. Americans 5) Developing skills and providing assistance that encourages
too have expressed their aversion to using contact tracing apps in self-efficacy and possibility of positive behavior change –
a recent poll, with only 43 percent of people saying that they Adequate arrangements for people from lower socio-economic
trusted companies like Google or Apple with their data. strata, stable and trustworthy financial schemes for middle class,
plan to support small business and a means to become a bridge
between the affluent class and the needy class are some of the
ways to foster positive behavior change and develop natural trust.
5 ENSURING SOCIAL COMPLIANCE – A
Other than health belief model, some theories that can be useful
BEHAVIORAL PERSPECTIVE are:
As the epidemic drags on and on, the continued restrictions
on social activity are becoming more and more unbearable. There Theory of Reasoned Action – This theory implies that an
is an increasing tendency, especially among younger people who individual’s behavior is based on the outcomes which the
are much less at risk of serious symptoms, to violate the individual expects as a result of such behavior. In a practical
restrictions and spread the disease through irresponsible actions. scenario, if the health officials want the people to follow a
However, City F, a rise in violator behavior can completely particular trend, let us say based on our model, they need to
nullify the effects of lockdown over the past few weeks or reinforce the advantages of targeted behavior and strategically
months. Here we discuss how public health professionals and address the barriers. For instance, to enforce separation minima
policy makers can resort to behavior/psychological theories to even when it is apparently proving ineffective and the cases are
ensure compliance among the common people. The most widely increasing, they can use the examples of Cities B and C to
used model is Health Belief Model which has been used convince the citizens that violations – and hence violators – can
successfully in addressing public health challenges. We briefly be responsible for thousands of excess deaths. Trans-theoretical
discuss the utility of this model in the current situation. Model – This model posits that any health behavior change
entails progress through six stages of change: precontemplation,
Health belief model is a theoretical model which hypothesizes contemplation, preparation, action, maintenance and
that interventions will be most effective if they target key factors termination. For instance, it was observed that in March, despite
that influence health behaviors such as perceived susceptibility, a rise in cases in New York City (NYC), people were not
perceived severity, perceived benefits, perceived barriers to observing social restrictions the way they should have. Now, we
action and exposure to factors that prompt action and self- can see that with passing time, the behavior of the masses
efficacy. In general, this model can be used to design short and transforms according to the stages of this model
long term interventions. The prime components of this model
which are relevant in the current scenario can be outlined as Precontemplation – This is a stage where people are
follows. typically not cognizant of the fact that their behavior is
troublesome and may cause undesirable consequences. There is
1) Conducting a health need assessment to determine the a long way to go before an actual behavior change. This phase
target population – The best example is the demarcation of zones coincides with the commencement of cases in NYC.
in India depending on the level of risk. Red zone is highest risk,
orange zone is average risk and green zone translates into no Contemplation – Recognition of the behavior as problematic
cases since last 21 days. Classification is multifactorial, taking begins to surface and a shift begins towards behavior change.
into account the incidence of cases, the doubling rate and the When the cases started being reported all over media and the
limit of testing and surveillance feedback to classify the districts. major cause of spread began to surface, citizens started paying
attention to their activities.
KDD KiML 2020 Sharma et al.

Preparation – People start taking small steps toward on their course of action. Since the virus is a new one, there is no
behavior change like in our case, exhibiting hygienic practices precedent which can act as a model. Even among emerging
and ensuring six feet separation minima. infectious diseases, this latest one is particularly unpredictable,
since minuscule changes in parameters can cause dramatic
Action – This stage covers the phase where people have just changes in the system’s behavior. This phenomenon is best
changed their behavior and have positive intention to maintain illustrated by the notional cities, discussed previously. For
that approach. In this instance, people continue to practice social example, to get from City A to B, all we did was increase by 50
restrictions and hygiene positively. percent the fraction of people who escaped the contact-tracers’
net. The result was a 30 times (not 30 percent!) increase in the
Maintenance – This stage focuses on maintenance and
total number of cases. Similarly, the difference between Cities B
continuity toward the adopted approach. Majority of people in
and D is an 11-day delay (recall that the first seven days in the
NYC are exhibiting positive behavior and maintaining it
plots are the seeding period, so they don’t count) in imposing the
throughout the stages of reopening phases. This is vitally
lockdown in D. 11 days out of a 200-plus-day run might not
important to ensure that NYC stops at partial herd immunity like
sound like a lot. But, that was enough to create tens of thousands
City D instead of blowing up again like City C.
of additional cases, risk overstressing healthcare systems and at
Termination – There is lack of motivation to come back to the same time shorten the epidemic duration by a factor of three.
the unhealthy behaviors and some sections of people across the
Further uncertainty comes from the fact that the parameter
country/world will continue practicing good hygiene (though not
values are changing constantly. It is a well- known fact the
social restrictions!) in our day-to-day lives.
reported fraction of asymptomatic carriers has increased
Social Ecological theory – This theory highlights multiple continuously over the last three months or so. Considering the
levels of influences that molds the decision. In our case, let us sensitivity of this or any other model to parameter values, such
say for example that the decision is to maintain sufficient changes can completely invalidate the results of a model as well
physical separation once offices are opened up. To successfully as any decision which was made on their basis. Identifying
follow this, there is a complex interplay between individual, potential exposures is much easier in a smaller city than a large
relationship, community and societal factors that comes into or densely populated one. It is also more effective if the cases are
action. Law enforcement authorities need to take this into mostly from the sophisticated social class who can use mobile
consideration. A group of individuals when motivated by one phone contact tracing apps or otherwise keep (at least mental)
another to follow the guidelines, builds a good connection within records of their movements and of the people they interacted
the society, and in turn there is a high probability to build a with. However, if there is an outbreak among the unsophisticated
healthy network within a defined area. A negative interplay at class, then even the most skillful contact tracer might run up
different levels of motivation may in turn, prove disastrous and against a wall of zero or false information. In such cases there are
cause all efforts go down the drain. A perfect illustration of this limited options that are left to the authorities to proceed in a
in the present condition is how various NGO’s are working in conducive manner.
conjunction with public health authorities to bring about a change
at an individual level by door-to-door campaigning. This propels
the behavior of even the most potentially recalcitrant population India went into lockdown on 25 March 2020. At that time, the
in the most desirable way i.e. wearing masks and gloves, official figures stated that there were only 571 cases, which made
adopting hand hygiene, being cognizant of symptoms arising in the decision appear premature to many people. Indeed, a seven-
any member of the family and following quarantine rules in case day delay of lockdown was suggested so that the migrant workers
of travel from other states. would have been able to return to their homes. However, when
the lockdown was imposed, the testing had also been woefully
6 SOCIAL ATTITUDES AND BEHAVIOUR inadequate, with a nationwide total of just 22,694 tests having
In this Section we address another important issue related to been conducted up to that date. If we use the extrapolation
the Coronavirus. This is that the widely heterogeneous case technique of inferring case counts from death counts, then using
profiles in different regions have often led to “corona contests” the same 1 percent mortality rate and 20 day interval to death, we
among these regions. Far too often, the residents of better-off find almost 40,000 assumed cases on the day that the lockdown
regions are seen heaping scorn on worse-hit regions. We have began. If we go by this figure, then the lockdown wasn’t really
selected a tiny handful of representative media articles, early, and possibly should have been enforced earlier still in
castigating the approaches of India, USA and Sweden, to show trouble zones such as Mumbai. Certainly, if the figure of 40,000
the breadth and vitriol of such commentary [12][13][14] cases is true, then one further week of normal life (with huge
[15][16].A feature common to almost all opinion pieces like this crowds in trains and railway stations) might have been
is that their authors do not have the slightest knowledge of the disastrous. From the vantage point of today, alternate
issues involved, either epidemiological or economic. arrangements should definitely have been made much earlier for
rehabilitation of the migrant workers. However these
Before embarking on criticisms, we should note that policy arrangements would have involved considerable complexity in
decisions need to be taken in real time, as the situation evolves. the prevailing situation, and were certainly not as easy as one
The authorities do NOT have the benefit of hindsight to decide
KDD KiML 2020 Sharma et al.

week’s delay in announcing lockdown. Sweden, which has • Efficiency of contact tracing comes at the expense of
adopted a controlled herd immunity strategy, has been accused people’s privacy – balancing between the two is a delicate
of playing with fire. It is also possible that the Swedish optimization problem.
authorities are aware that they do not have the contact tracing
capacity required for performing like City A and hence are • In some regions, restrictions such as masks and six-feet
attempting something like City D – a faster end of the epidemic separation minima must be maintained for a very long time to
than City B at the expense of a higher case count. To make a come. The public health authorities can ensure compliance by
comprehensive analysis of their policy, it is crucial to know not resorting to socio –behavioral theories/approaches.
only the last intricate detail of the epidemiological aspects but
 In deploying advanced contact tracing techniques,
also the details of the economic considerations. That is almost
significant consideration has to be given for ensuring high
impossible. On a different note however, we have seen reports
data security and lay down privacy regulations that are
[17], [18] stating that the virus has entered into old age homes
convincing to the users
and similar establishments, causing hundreds of deaths over
there. Assuming that these reports are not overturned in the
 Control the spread by swift identification and
course of time, allowing the ingress of virus into high-risk areas
isolation of cases accompanied by tracing and quarantine for
is an indefensible action, whatever the overall epidemiological
at least 2 weeks
strategy.
 Empowering of individuals and communities by the
government to facilitate efficient capacity building.
Finally, extremely important public health factors such as the
racial dependence of susceptibility and/or transmissibility have just  Multidisciplinary coordination, strong leadership to
started coming to the surface. Another complete grey area is the mobilize communities and take quick decisions coupled with
mutations which this new and vicious virus are undergoing and what thoughtful development of operation plans are likely to prove
effect they might have on the spreading dynamics. Some reports also considerably efficient in handling this pandemic to the best of
reflect that the change in genetic composition due to mutation might our capacity.
be the reason behind huge differences in the crude infection rate
between countries [19][20]. In the absence of a clear picture about References
this, any public health measure is all the more likely to be a random
[1] B. Shayak and M. M. Sharma, “Retarded
guess with non-zero probabilities of both success and failure. Not logistic equation as a universal dynamic model for the
everything about corona is random or outside one’s control though. spread of COVID-19,” medRxiv, p.
Amongst the European countries, we can see that Germany, Austria, 2020.06.09.20126573, 2020, doi:
Switzerland, Denmark, Norway and Finland have definitely 10.1101/2020.06.09.20126573.
managed the epidemic while their neighbors have not, which rules
out some hidden luck factor. The same has happened in Kerala and [2] E. Okanyene, B. Rader, Y. L. Barnoon, L.
Karnataka (also in India). This has been feasible only due to Goodwin, and J. S. Brownstein, “Analysis of hospital
governmental awareness and hard work, and people’s cooperation. traffic and search engine data in Wuhan China
indicates early disease activity in the Fall of 2019,”
Similarly, there are some governments which have been clearly
Harvard, 2020, [Online]. Available:
guilty of negligence or hubris in their management of the disease. It
http://nrs.harvard.edu/urn-3:HUL.InstRepos:42669767.
would also be noteworthy to observe and take lessons from the some
of the new places like Alabama, Arkansas, Florida , Texas etc which [3] CDC, “Forecasting COVID-19 in the US,”
have been recently identified as potential hotspots of this pandemic. 2020. https://www.cdc.gov/coronavirus/2019-
Lastly, our conclusion best resonates with the message that ncov/covid-data/forecasting-us.html.
coronavirus is not some kind of race but a public health disaster and
[4] “Microsoft coronavirus webpage.”
we should adopt a unified approach to the fight against it.
https://www.bing.com/covid.
CONCLUSION [5] “COVID-19 in India.” [Internet]. Available
Here, we summarize the take-home messages from this paper: from: https://www.covid19india.org/.

• A city can reopen only if it is past the peak of cases. [6] L. Star and S. Moghadas, “The Role of
Reopening must be accompanied by robust contact tracing. The Mathematical Modelling in Public Health Planning and
US CDC has laid down a set of reopening guidelines which are Decision Making,” Natl. Collab. Cent. Infect. Dis., vol.
(5)2, no. 2, pp. 285–299, 2010.
compatible with our model and its solutions.
[7] Livemint, ““Many states are far short of
• Incorporation of socio-behavioral theories can come COVID-19 testing levels.”
into play for effective execution of interventional strategies. https://www.statnews.com/2020/04/27/coronavirus-
many-states-short-of-testing-levels-needed-for-
safereopening/.
KDD KiML 2020 Sharma et al.

[8] Harvard Business Review, “A Plan to no-longer-exists-provokes-controversy.html.
Safely Reopen the U.S. Despite Inadequate Testing.”
https://hbr.org/2020/05/a-plan-to-safely-reopen-the-u-
s-despite-inadequate-testing.

[9] S. Telles, S. K. Reddy, and H. R. Nagendra,
“Variation in False Negative Rate of RT-PCR Based
SARS-CoV-2 Tests by Time Since Exposure,” J.
Chem. Inf. Model., vol. 53, no. 9, pp. 1689–1699,
2019, doi: 10.1017/CBO9781107415324.004.

[10] M. Lee, “Given low adoption rate of
TraceTogether, experts suggest merging with
SafeEntry or other apps,” Today, 2020.
https://www.todayonline.com/singapore/given-low-
adoption-rate-tracetogether-experts-suggest-merging-
safeentry-or-other-apps.

[11] A. Zargar, “Privacy, security concerns as
India forces virus-tracking app on millions,” CBS
News. .

[12] K. Bajpai, “Five lessons of COVID.”
Available: from:
https://timesofindia.indiatimes.com/blogs/toi-
editpage/five-lessons-of-covid-factors-that-are-
negative-for-india-are-having-greater-impact-than-
mitigating-ones/..

[13] K. Grimes, “Is politics the reason why Gov.
Newsom is keeping California locked down ?,”
California Globe. .

[14] R.Guha, “What Modi got wrong on
COVID-19 and how he can fix it.”
https://www.ndtv.com/opinion/5-lessons-for-modi-on-
covid-19-by-ramachandra-guha-2227259.

[15] K. Weintraub, “Sweden sticks with
controverial covid approach.,” [Online]. Available:
https://www.webmd.com/lung/news/20200501/sweden
-sticks-with-controversial-covid19-approach.

[16] The Island Now, “Cuomo has failed in his
handling of coronavirus.”
https://theislandnow.com/opinions-100/readers-write-
cuomo-has-failed-in-handling-of-coronavirus/.

[17] “Are care homes the dark side of Sweden’s
coronavirus strategy.”
https://www.euronews.com/2020/05/19/are-care-
homes-the-dark-side-of-sweden-s-coronavirus-
strategy.

[18] “What’s going wrong in Sweden’s care
homes.” .

[19] L. van Dorp et al., “Emergence of genomic
diversity and recurrent mutations in SARS-CoV-2,”
Infect. Genet. Evol., vol. 83, no. May, p. 104351, 2020,
doi: 10.1016/j.meegid.2020.104351.

[20] H. Ellyatt, “Coronavirus no longer exists
clinically - controversy,” CNBC.
https://www.cnbc.com/2020/06/02/claim-coronavirus-
Attention Realignment and Pseudo-Labelling for Interpretable
Cross-Lingual Classification of Crisis Tweets
Jitin Krishnan Hemant Purohit Huzefa Rangwala
Department of Computer Science Department of Information Department of Computer Science
George Mason University Sciences & Technology George Mason University
Fairfax, VA George Mason University Fairfax, VA
jkrishn2@gmu.edu Fairfax, VA rangwala@gmu.edu
hpurohit@gmu.edu

ABSTRACT 1 INTRODUCTION
State-of-the-art models for cross-lingual language understanding Social media platforms such as Twitter provide valuable information
such as XLM-R [7] have shown great performance on benchmark to aid emergency response organizations in gaining real-time situ-
data sets. However, they typically require some fine-tuning or cus- ational awareness during the sudden onset of crisis situations [4].
tomization to adapt to downstream NLP tasks for a domain. In this Extracting critical information about affected individuals, infras-
work, we study unsupervised cross-lingual text classification task tructure damage, medical emergencies, or food and shelter needs
in the context of crisis domain, where rapidly filtering relevant data can help emergency managers make time-critical decisions and
regardless of language is critical to improve situational awareness allocate resources efficiently [15, 21, 22, 30, 31, 36]. Researchers
of emergency services. Specifically, we address two research ques- have designed numerous classification models to help towards this
tions: a) Can a custom neural network model over XLM-R trained humanitarian goal of converting real-time social media streams into
only in English for such classification task transfer knowledge to actionable knowledge [1, 22, 26, 28, 29]. Recently, with the advent
multilingual data and vice-versa? b) By employing an attention of multilingual models such as multilingual BERT [9] and XLM
mechanism, does the model attend to words relevant to the task [20], researchers have started adopting them to multilingual disas-
regardless of the language? To this goal, we present an attention ter tweets [6, 25]. Since XLM-R [7] has been shown to be the most
realignment mechanism that utilizes a parallel language classifier to superior model in cross-lingual language understanding, we re-
minimize any linguistic differences between the source and target strict our work to this model to explore the aspects of cross-lingual
languages. Additionally, we pseudo-label the tweets from the target transfer of knowledge and interpretability.
language which is then augmented with the tweets in the source
language for retraining the model. We conduct experiments using
Twitter posts (tweets) labelled as a ‘request’ in the open source
data set by Appen1 , consisting of multilingual tweets for crisis re-
sponse. Experimental results show that attention realignment and
pseudo-labelling improve the performance of unsupervised cross-
lingual classification. We also present an interpretability analysis by
evaluating the performance of attention layers on original versus
translated messages.

KEYWORDS
Social Media, Crisis Management, Text Classification, Unsuper-
vised Cross-Lingual Adaptation, Interpretability
ACM Reference Format:
Jitin Krishnan, Hemant Purohit, and Huzefa Rangwala. 2020. Attention Re- Figure 1: Problem: Unsupervised cross-lingual tweet classifi-
alignment and Pseudo-Labelling for Interpretable Cross-Lingual Classifica- cation, e.g., train a model using English tweets, predict labels
tion of Crisis Tweets. In Proceedings of KDD Workshop on Knowledge-infused
for Multilingual tweets, and vice-versa.
Mining and Learning (KiML’20). , 7 pages. https://doi.org/10.1145/nnnnnnn.
nnnnnnn
In this work, we address two questions. First is to examine
1 https://appen.com/datasets/combined-disaster-response-data/ whether XLM-R is effective in capturing multilingual knowledge by
constructing a custom model over it to analyze if a model trained
using English-only tweets will generalize to multilingual data and
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, vice-versa. Social media streams are generally different from other
California, USA, August 24, 2020. Use permitted under Creative Commons License text, given the user-generated content. For example, tweets are
Attribution 4.0 International (CC BY 4.0).
usually short with possibly errors and ambiguity in the behavioral
KiML’20, August 24, 2020, San Diego, California, USA,
© 2020 Copyright held by the author(s). expressions. These properties in turn make the classification task or
https://doi.org/10.1145/nnnnnnn.nnnnnnn extracting representations a bit more challenging. Second question
KiML’20, August 24, 2020, San Diego, California, USA,
Krishnan, et al.

is to examine whether word translations will be equally attended With more and more machine learning systems being adopted
by the attention layers. For instance, the words with higher atten- by diverse application domains, transparency in decision-making
tion weights in a sentence in Haitian Creole such as “Tanpri nou inevitably becomes an essential criteria, especially in high-risk
bezwen tant avek dlo nou zon silo mesi” should align with the words scenarios [12] where trust is of utmost importance. With deep
in its corresponding translated tweet in English “Please, we need neural networks, including natural language systems, shown to
tents and water. We are in Silo, Thank you!”. Our core idea is that if be easily fooled [16], there has been many promising ideas that
‘dlo’ in the Haitian tweet has a higher weight, so should its English empower machine learning systems with the ability to explain
translation ‘water’. This word-level language agnostic property can their predictions [5, 32]. Gilpin et al. [11] presents a survey of
promote machine learning models to be more interpretable. This interpretability in machine learning, which provides a taxonomy of
also brings several benefits to downstream tasks such as knowledge research that addresses various aspects of this problem. Similar to
graph construction using keywords extracted from tweets. In situa- the work by Ross et al. [33], we employ an attention-based approach
tions where data is available only in one language, this similarity in to evaluate model interpretability applied to the crisis-domain.
attention would still allow us to extract relevant phrases in cross-
lingual settings. To the best of our knowledge in crisis analytics 3 METHODOLOGY
domain, aligning attention in cross-lingual setting is not attempted 3.1 Problem Statement: Unsupervised
before. In this work, we focus our classification experiments only
to tweets containing ‘request’ intent, which will be expanded to
Cross-Lingual Crisis Tweet Classification
other behaviors, tasks, and datasets in the future. Consider tweets in language A and their corresponding translated
Contributions: We propose a novel attention realignment method tweets in language B. The task of unsupervised cross-lingual classi-
which promotes the task classifier to be more language agnostic, fication is to train a classifier using the data only from the source
which in turn tests the effectiveness of multilingual knowledge language and predict the labels for the data in the target language.
capture of XLM-R model for crisis tweets; and a pseudo-labelling This experimental set up is usually represented as 𝐴 → 𝐵 for train-
procedure to further enhance the model’s generalizability. Furher, ing a model using A and testing on B or 𝐴 → 𝐵 for training a
incorporating the attention-based mechanism allows us to perform model using B and testing on A. 𝑋 refers to the data and 𝑌 refers
an interpretability analysis on the model, by comparing how words to the ground truth labels. The multilingual dataset used in our
are attended in the original versus translated tweets. experiments consists of original multilingual (𝑚𝑙) tweets and their
translated (𝑒𝑛) tweets in English. To summarize:
Experiment 𝐴 (𝑒𝑛 → 𝑚𝑙):
2 RELATED WORK AND BACKGROUND Input: 𝑋𝑒𝑛 , 𝑌𝑒𝑛 , 𝑋𝑚𝑙
𝑝𝑟𝑒𝑑
There are numerous prior works (c.f. surveys [4, 14]) that focus Output: 𝑌𝑚𝑙 ← 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 (𝑋𝑚𝑙 )
specifically on disaster related data to perform classification and Experiment 𝐵 (𝑚𝑙 → 𝑒𝑛):
other rapid assessments during an onset of a new disaster event. Input: 𝑋𝑚𝑙 , 𝑌𝑚𝑙 , 𝑋𝑒𝑛
Crisis period is an important but challenging situation, where col- 𝑝𝑟𝑒𝑑
Output: 𝑌𝑒𝑛 ← 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 (𝑋𝑒𝑛 )
lecting labeled data during an ongoing event is very expensive. This
problem led to several works on domain adaptation techniques in 3.2 Overview
which machine learning models can learn and generalize to unseen
crisis event [3, 10, 18, 23]. In the context of crisis data, Nguyen et al. In the following sections, we propose two methodologies to en-
[28] designed a convolutional neural network model which does not hance cross-lingual classification: 1) Attention Realignment and 2)
require any feature engineering and Alam et al. [1] designed a CNN Pseudo-Labelling. Attention realignment utilizes a language clas-
architecture with adversarial training on graph embeddings. Krish- sifier which is trained in parallel to realign the attention layer of
nan et al. [19] showed that sharing a common layer for multiple the task classifier such that the weights are more geared towards
tasks can improve performance of tasks with limited labels. task-specific words regardless of the language. Pseudo-Labelling
In multilingual or cross-lingual direction, many works [8, 17] further enhances the classifier by adding high quality seeds from
tried to align word embeddings (such as fastText [27]) from different the target language that are pseudo-labelled by the task classifier.
languages into the same space so that a word and its translations
have the same vector. These models are superseded by models such
3.3 Attention Realignment by Parallel
as multilingual BERT [9] and XLM-R [7] that produce contextual Language Classifier
embeddings which can be pretrained using several languages to- As depicted in Fig 2, model on the left side is the task classifier and
gether to achieve impressive performance gains on multilingual the model on the right side is a language classifier that is trained in
use-cases. parallel. The purpose of this language classifier is to pick up aspects
Attention mechanism [2, 24] is one of the most widely used meth- that is missed by the XLM-R model. This could be tweet-specific,
ods in deep learning that can construct a context vector by weigh- crisis-specific, or other linguistic nuances that can separate original
ing on the entire input sequence which improves over previous tweets and translated tweets. Note that semantically, translated
sequence-to-sequence models [13, 34, 35]. As the model produces words are expected to have similar XLM-R representations.
weights associated with each word in a sentence, this allows for Attention realignment is a mechanism we introduce to promote
evaluating interpretability by comparing the words that are given the task classifier to be more language independent. The main idea
priority in original versus translated tweets. is that the words that are given higher attention in a language
KiML’20, August 24, 2020, San Diego, California, USA,
Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets

Figure 2: Attention Realignment with Pseudo-Labelling over XLM-R model

Notation Definition representation in language agnostic models; while the sentence
𝑒𝑛 Tweets translated to English (‘message’ structure, grammar, and other nuances can vary. We enforce this
column in the dataset) rule by constructing two operations:
𝑚𝑙 Multilingual Tweets (‘original’ column (1) Attention Difference: When a sentence goes through model
in the dataset) M1, it also goes through model M2. For the same sentence,
𝛼 Attention Layer this returns two attention layer weights: one from the task
𝑇 A component that uses Task-specific classifier (𝛼−
→) and the other from the language classifier
𝑇
data. i.e., + and − ‘Request’ tweets (𝛼𝑇 ). Directly subtracting 𝛼−
−
→ ′ → ′ from 𝛼−
𝑇
→ poses two issues: 1)
𝑇
𝐿 A component that uses Language- we do not know whether they are comparable and 2) 𝛼− →′
𝑇
specific data. i.e., 𝑒𝑛 and 𝑚𝑙 tweets may have negative values. A simple solution to this is to
𝑎𝐵𝑖𝐿𝑆𝑇 𝑀 Activation from the BiLSTM layer normalize bothe vectors and clip 𝛼−→ ′ such that it is between
𝑇
𝛽, 𝛾, 𝜁 Hyperparameters 0 and 1. Thus, an attention subtraction step is as follows:
Table 1: Notations
𝛼−
→ 𝛼−
→′
!
𝑇 𝑇
→ − 𝛾𝑇 𝑐𝑙𝑖𝑝
𝛼− → ′ , 0, 1
𝛼−
(1)
𝑇 𝑇

classifier should be less important in a task classifier. For example, where 𝛾𝑇 is a hyperparameter to tune the amount of subtrac-
‘dlo’ in Haitian and ‘water’ in English should have the same vector tion needed for the task classifier. Similarly, for the language
KiML’20, August 24, 2020, San Diego, California, USA,
Krishnan, et al.

classifier, 𝑇𝑥 30
Deep Learning Library Keras
𝛼−
→′ 𝛼−
→
!
𝐿 𝐿
− 𝛾𝐿 𝑐𝑙𝑖𝑝 , 0, 1 (2) Optimizer Adam [𝑙𝑟 = 0.005, 𝑏𝑒𝑡𝑎 1 = 0.9,
𝛼→ ′
−
𝐿 𝛼−
→
𝐿 𝑏𝑒𝑡𝑎 2 = 0.999, 𝑑𝑒𝑐𝑎𝑦 = 0.01]
(2) Attention Loss: Along with attention difference, the model Maximum Epoch 100
can also be trained by inserting an additional loss function Dropout 0.2
term that penalizes the similarity between the attention Early Stopping Patience 10
weights from the two classifiers. We use the Frobenius norm. Batch Size 32
𝐿 = ∥𝛼−
𝐴𝑡
→𝑇 𝛼−→′ ∥ 2
𝑇 𝑇 𝐹 (3) 𝜁𝑇 1
𝜁𝐿 0.1
𝐿𝐴𝑙 = ∥𝛼−
→𝑇 𝛼−
𝐿
→′ ∥ 2
𝐿 𝐹 (4) 𝛽𝑇 , 𝛽𝐿 , 𝛾𝑇 , 𝛾𝐿 0.01
for task and language respectively. Resulting final loss func- Table 3: Implementation Details
tion of joint training will be:

𝐿(𝜃 ) = 𝜁𝑇 𝐶𝐸𝑇 + 𝛽𝑇 𝐿𝐴𝑡 + 𝜁𝐿 𝐶𝐸𝐿 + 𝛽𝐿 𝐿𝐴𝑙 (5)
where 𝛽 is the hyperparameter to tune the attention loss We use the open source dataset from Appen3 consisting of multi-
weight, 𝜁 is the hyperparameter to tune the joint training lingual crisis response tweets. The dataset statistics for tweets with
loss, and 𝐶𝐸 denotes the binary cross entropy loss, ‘request’ behavior labels is shown in Table 2. For all the experiments,
the dataset is balanced for each split.
𝑁
1 Õ Each experiment is denoted as 𝐴 → 𝐵, where 𝐴 is the data that
𝐶𝐸 = − [𝑦𝑖 log 𝑦ˆ𝑖 + (1 − 𝑦𝑖 ) log(1 − 𝑦ˆ𝑖 )] (6)
𝑁 𝑖=1 is used to train the model and 𝐵 is the data that is used for testing
the model. For example, 𝑒𝑛 → 𝑚𝑙 means we train the model using
It is important to note that the Frobenius norm is not simply
English tweets and test on multilingual tweets.
between the attention weights of the two models but rather
Models are implemented in Keras and the details are shown in
between the attention weights produced by the two models
table 3. Hyperparameters 𝛽𝑇 , 𝛽𝐿 , 𝛾𝑇 , and 𝛾𝐿 are not exhaustively
on the same input tweet. For example, for a given tweet, the
tuned; we leave this exploration for future work.
task classifier attends more to task-specific words and the
language classifier attends to language-specific words. So
the mechanism makes sure that they are distinct. Baseline Model M1 Model M2
𝑒𝑛 → 𝑚𝑙 59.98 62.53 66.79
3.4 Pseudo-Labelling (80.57) (77.02) (82.39)
To enhance the model further, we pseudo-label the data in the 𝑚𝑙 → 𝑒𝑛 60.93 65.69 70.95
target language. For example, if we are training a model using the (70.07) (63.50) (73.84)
English tweets, we use the original tweets before translation for Table 4: Performance Comparison (Accuracy in %) for
pseudo-labelling. The idea is simply to gather high-quality seeds 𝑆𝑜𝑢𝑟𝑐𝑒 → 𝑇 𝑎𝑟𝑔𝑒𝑡 (𝑆𝑜𝑢𝑟𝑐𝑒 → 𝑆𝑜𝑢𝑟𝑐𝑒).
from the target to retrain the model. Note that, we still do not use Baseline = XLMR + BiLSTM + Attention.
any target labels here; still following the unsupervised goal. Thus, Model M1 = Baseline + Attention Realignment.
for retraining model M1 for 𝑒𝑛 → 𝑚𝑙, the new dataset would consist Model M2 = Model M1 + Pseudo-Labelling.
+ and 𝑋 𝑝𝑠𝑒𝑢𝑑𝑜+ as positive examples and 𝑋 − and 𝑋 𝑝𝑠𝑒𝑢𝑑𝑜−
of: 𝑋𝑒𝑛 𝑚𝑙 𝑒𝑛 𝑚𝑙
as negative examples.

3.5 XLM-R Usage 5 RESULTS & DISCUSSION
The recommended feature usage of XLM-R2 is either by fine-tuning Table 4 shows the cross-lingual performance comparison of all the
to the task or by aggregating features from all the 25 layers. We models. The three models are described below:
employ the later to extract the multilingual embeddings for the (1) Baseline: The baseline model consists of embeddings re-
tweets. trieved from XLM-R trained over BiLSTMs and Attention lay-
ers. This is a traditional sequence (text) classifier enhanced
4 DATASET & EXPERIMENTAL SETUP with attention mechanism. Activations from the BiLSTM
layers are weighed by the attention layer to construct the
Train Validation Test context vector which is then passed through a dense layer
Positive 3554 418 496 and softmax function to produce the classification output.
Negative 17473 2152 2128 (2) Model M1: Adding attention realignment to the baseline
model produces model M1. Attention realignment is achieved
Table 2: Dataset Statistics for both 𝑒𝑛 amd 𝑚𝑙 through a language classifier which is trained in parallel with
the goal to make the task classifier more language agnostic.
2 https://github.com/facebookresearch/XLM 3 https://appen.com/datasets/combined-disaster-response-data/
KiML’20, August 24, 2020, San Diego, California, USA,
Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets

Figure 3: Attention visualization example for ‘request’ tweets: words and their attention weights for two tweets in Haitian
Creole and its translation in English (darker the shade, higher the attention).

The attention weights for both task and language classifiers scores are shown in brackets in table 4. A deeper investigation in
are manipulated by each other during training by a process this direction on various other tasks can shed more light on the
of subtraction (attention difference) as well a loss component impact of realignment mechanism.
(attention loss). See section 3.3.
(3) Model M2: Adding the pseudo-labelling procedure to model 5.1 Interpretability: Attention Visualization
M1 produces model M2. Using Model M1 which is trained
We follow a similar attention architecture shown in [18]. The con-
to be language agnostic, tweets from the target languages
text vector is constructed as a result of dot product between the
are pseudo-labelled. High quality seeds are selected (using
attention weights and word activations. This represents the inter-
Model M1 𝑝>0.7) and augmented to the original training
pretable layer in our architecture. The attention weights represent
dataset to retrain the task classifier.
the importance of each word in the classification process. Two ex-
Results show that, for cross-lingual evaluation on 𝑒𝑛 → 𝑚𝑙, amples are shown in figure 3. In the first example, both 𝑒𝑛 → 𝑒𝑛
model M1 outperforms the baseline by +4.3% and model M2 outper- and 𝑚𝑙 → 𝑚𝑙 give attention to the word ‘hungry’ (i.e., ‘grangou’ in
forms by +11.4%. On 𝑚𝑙 → 𝑒𝑛, model M1 outperforms the baseline Haitian Creole). Note that these two are results from the models
by +7.8% and model M2 outperforms by +16.5%. This shows that that are trained in the same language in which they are tested; thus,
both models are effective in cross-lingual crisis tweet classification. expecting an ideal performance. For the baseline model in the cross-
An interesting observation to note is that using attention realign- lingual set-up 𝑒𝑛 → 𝑚𝑙, although it correctly predicts the label, the
ment alone decreased the classification performance in the same attention weights are more spread apart. In model M2 with atten-
language, which is brought back up by pseudo-labelling. These tion realignment and pseudo-labelling, although with some spread,
KiML’20, August 24, 2020, San Diego, California, USA,
Krishnan, et al.

the attention weights are shifted more toward ‘grangou’. Similarly [8] Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer,
in example 2, the attention weights in the baseline model are more and Hervé Jégou. 2017. Word Translation Without Parallel Data. arXiv preprint
arXiv:1710.04087 (2017).
spread apart. Cross-lingual performance of model M2 aligns more [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
with 𝑒𝑛 → 𝑒𝑛 and 𝑚𝑙 → 𝑚𝑙. These examples show the importance Pre-training of deep bidirectional transformers for language understanding. arXiv
preprint arXiv:1810.04805 (2018).
of having interpretability as a key criterion in cross-lingual crisis [10] Yaroslav Ganin and Victor Lempitsky. 2014. Unsupervised domain adaptation by
tweet classification problems; which can also be used for down- backpropagation. arXiv preprint arXiv:1409.7495 (2014).
stream tasks such as extracting relevant keywords for knowledge [11] Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and
Lalana Kagal. 2018. Explaining explanations: An overview of interpretability of
graph construction. machine learning. In 2018 IEEE 5th International Conference on data science and
advanced analytics (DSAA). IEEE, 80–89.
[12] David Gunning. 2017. Explainable artificial intelligence (xai). Defense Advanced
6 CONCLUSION Research Projects Agency (DARPA), nd Web 2 (2017).
[13] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
We presented a novel approach for unsupervised cross-lingual cri- computation 9, 8 (1997), 1735–1780.
sis tweet classification problem using a combination of attention [14] Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. 2015.
Processing social media messages in mass emergency: A survey. ACM Computing
realignment mechanism and a pseudo-labelling procedure (over Surveys (CSUR) 47, 4 (2015), 1–38.
the state-of-the-art multilingual model XLM-R) to promote the task [15] Muhammad Imran, Prasenjit Mitra, and Carlos Castillo. 2016. Twitter as a lifeline:
classifier to be more language agnostic. Performance evaluation Human-annotated twitter corpora for NLP of crisis-related messages. arXiv
preprint arXiv:1605.05894 (2016).
showed that both models M1 and M2 outperformed the baseline by [16] Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading
+4.3% and +11.4% respectively for cross-lingual text classification comprehension systems. arXiv preprint arXiv:1707.07328 (2017).
from English to Multilingual. We also presented an interpretabil- [17] Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Hervé Jégou, and Edouard
Grave. 2018. Loss in translation: Learning bilingual word mapping with a retrieval
ity analysis by comparing the attention layers of the models. It criterion. arXiv preprint arXiv:1804.07745 (2018).
shows the importance of incorporating a word-level language ag- [18] Jitin Krishnan, Hemant Purohit, and Huzefa Rangwala. 2020. Diversity-Based
Generalization for Neural Unsupervised Text Classification under Domain Shift.
nostic characteristic in the learning process, when training data https://arxiv.org/pdf/2002.10937.pdf (2020).
is available only in one language. Performing extensive hyperpa- [19] Jitin Krishnan, Hemant Purohit, and Huzefa Rangwala. 2020. Unsupervised and
rameter tuning and expanding the idea to other tasks (including Interpretable Domain Adaptation to Rapidly Filter Social Web Data for Emergency
Services. arXiv preprint arXiv:2003.04991 (2020).
cross-task/multi-task) are left as future work. We also plan another [20] Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model
direction for future work as to incorporate the human-engineered pretraining. arXiv preprint arXiv:1901.07291 (2019).
knowledge from the multilingual knowledge graphs such as Ba- [21] Kathy Lee, Ankit Agrawal, and Alok Choudhary. 2013. Real-time disease surveil-
lance using twitter data: demonstration on flu and cancer. In Proceedings of the
belNet in our model architecture that could improve the learning 19th ACM SIGKDD international conference on Knowledge discovery and data
of similar concepts across languages critical to the crisis response mining. 1474–1477.
[22] Hongmin Li, Doina Caragea, Cornelia Caragea, and Nic Herndon. 2018. Disaster
agencies. response aided by tweet classification with a domain adaptation approach. Journal
Reproducibility: Source code is available available at: https:// of Contingencies and Crisis Management 26, 1 (2018), 16–27.
github.com/jitinkrishnan/Cross-Lingual-Crisis-Tweet-Classification [23] Zheng Li, Ying Wei, Yu Zhang, and Qiang Yang. 2018. Hierarchical attention
transfer network for cross-domain sentiment classification. In Thirty-Second
AAAI Conference on Artificial Intelligence.
[24] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effec-
7 ACKNOWLEDGEMENT tive approaches to attention-based neural machine translation. arXiv preprint
arXiv:1508.04025 (2015).
Authors would like to thank U.S. National Science Foundation [25] Guoqin Ma. 2019. Tweets Classification with BERT in the Field
grants IIS-1815459 and IIS-1657379 for partially supporting this of Disaster Management. https://pdfs.semanticscholar.org/d226/
research. 185fa1e14118d746cf0b04dc5be8f545ec24.pdf.
[26] Reza Mazloom, Hongmin Li, Doina Caragea, Cornelia Caragea, and Muhammad
Imran. 2019. A Hybrid Domain Adaptation Approach for Identifying Crisis-
Relevant Tweets. International Journal of Information Systems for Crisis Response
REFERENCES and Management (IJISCRAM) 11, 2 (2019), 1–19.
[1] Firoj Alam, Shafiq Joty, and Muhammad Imran. 2018. Domain adaptation with [27] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Ar-
adversarial training and graph embeddings. arXiv preprint arXiv:1805.05151 mand Joulin. 2018. Advances in Pre-Training Distributed Word Representations.
(2018). In Proceedings of the International Conference on Language Resources and Evalua-
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma- tion (LREC 2018).
chine translation by jointly learning to align and translate. arXiv preprint [28] Dat Tien Nguyen, Kamela Ali Al Mannai, Shafiq Joty, Hassan Sajjad, Muham-
arXiv:1409.0473 (2014). mad Imran, and Prasenjit Mitra. 2016. Rapid classification of crisis-related
[3] John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation data on social networks using convolutional neural networks. arXiv preprint
with structural correspondence learning. In Proceedings of the 2006 conference on arXiv:1608.03902 (2016).
empirical methods in natural language processing. 120–128. [29] Ferda Ofli, Patrick Meier, Muhammad Imran, Carlos Castillo, Devis Tuia, Nicolas
[4] Carlos Castillo. 2016. Big crisis data: social media in disasters and time-critical Rey, Julien Briant, Pauline Millet, Friedrich Reinhard, Matthew Parkan, et al. 2016.
situations. Cambridge University Press. Combining human computing and machine learning to make sense of big (aerial)
[5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter data for disaster response. Big data 4, 1 (2016), 47–59.
Abbeel. 2016. Infogan: Interpretable representation learning by information max- [30] Bahman Pedrood and Hemant Purohit. 2018. Mining help intent on twitter during
imizing generative adversarial nets. In Advances in neural information processing disasters via transfer learning with sparse coding. In International Conference
systems. 2172–2180. on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior
[6] Jishnu Ray Chowdhury, Cornelia Caragea, and Doina Caragea. 2020. Cross- Representation in Modeling and Simulation. Springer, 141–153.
Lingual Disaster-related Multi-label Tweet Classification with Manifold Mixup. [31] Hemant Purohit, Carlos Castillo, Fernando Diaz, Amit Sheth, and Patrick Meier.
In Proceedings of the 58th Annual Meeting of the Association for Computational 2013. Emergency-relief coordination on social media: Automatically matching
Linguistics: Student Research Workshop. 292–298. resource requests and offers. First Monday 19, 1 (Dec. 2013).
[7] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil- [32] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should i
laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd
and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning ACM SIGKDD international conference on knowledge discovery and data mining.
at scale. arXiv preprint arXiv:1911.02116 (2019). 1135–1144.
KiML’20, August 24, 2020, San Diego, California, USA,
Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets

[33] Andrew Slavin Ross, Michael C Hughes, and Finale Doshi-Velez. 2017. Right for [36] István Varga, Motoki Sano, Kentaro Torisawa, Chikara Hashimoto, Kiyonori
the right reasons: Training differentiable models by constraining their explana- Ohtake, Takao Kawai, Jong-Hoon Oh, and Stijn De Saeger. 2013. Aid is out there:
tions. arXiv preprint arXiv:1703.03717 (2017). Looking for help from tweets during a large scale disaster. In Proceedings of the
[34] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural net- 51st Annual Meeting of the Association for Computational Linguistics (Volume 1:
works. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681. Long Papers). 1619–1629.
[35] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
with neural networks. In Advances in neural information processing systems. 3104–
3112.