<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>San Diego,
California, USA, August</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>KiML 2020</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Editors: Manas Gaur</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alejandro Jaimes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fatma Özcan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Srinivasan Parthasarathy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Organizers: Manas Gaur (AI Institute, University of South Carolina) Alejandro(Alex) Jaimes (Dataminr Inc. NYC) Fatma Özcan (IBM Research Almaden) Srinivasan Parthasarathy (Ohio State University) Sameena Shah (JP Morgan NYC) Amit Sheth (AI Institute, University of South Carolina) Biplav Srivastava, IBM Chief Analytics Office</institution>
          ,
          <addr-line>NYC</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Webmaster: Vishal Pallagani (AI Institute, University of South Carolina) Ibrahim Salman, AI Institute, University of South Carolina</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>24</volume>
      <issue>2020</issue>
      <fpage>1</fpage>
      <lpage>22</lpage>
      <abstract>
        <p>First International Workshop on Advancing Decision</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>A U G U S T
SAN DIEGO , CA
C o - l o c a t e d w i t h
2 6 t h A C M C o n f e r e n c e o n
K n o w l e d g e D i s c o v e r y a n d D a t a M i n i n g
K D D 2 0 2 0 , S a n D i e g o , C a l i f o r n i a
h t t p : / / k i m l 2 0 2 0 . a i i s c . a i /</p>
    </sec>
    <sec id="sec-2">
      <title>Proceedings of the</title>
    </sec>
    <sec id="sec-3">
      <title>ACM SIGKDD Workshop on Knowledge-infused Mining and Learning (KiML)</title>
      <p>Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee.</p>
      <p>These proceedings are not included in the ACM Digital Library.</p>
      <sec id="sec-3-1">
        <title>KiML’20, August 24, 2020, San Diego, California, USA.</title>
        <p>Copyright  (c)  2020  held  by  the  author(s).  In  M.  Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the ACM SIGKDD 2020  
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, California, USA, August 24, 2020. Use permitted under Creative 
Commons License Attribution 4.0 International (CC BY 4.0).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>ACM SIGKDD Workshop on Knowledge-infused Mining and Learning (KiML)</title>
      <p>Program Committee:</p>
      <sec id="sec-4-1">
        <title>Nitin Agarwal (University of Arkansas)</title>
      </sec>
      <sec id="sec-4-2">
        <title>Amanuel Alambo (Kno.e.sis Center)</title>
      </sec>
      <sec id="sec-4-3">
        <title>Shreyansh Bhatt (Amazon)</title>
      </sec>
      <sec id="sec-4-4">
        <title>Vasilis Efthymiou (IBM Research)</title>
      </sec>
      <sec id="sec-4-5">
        <title>Utkarshani Jaimini (AI Institute, University of South Carolina)</title>
        <p>Ugur Kurşuncu (AI Institute, University of South Carolina)</p>
      </sec>
      <sec id="sec-4-6">
        <title>Sarasi Lalithsena (IBM Watson)</title>
      </sec>
      <sec id="sec-4-7">
        <title>Chuan Lei (IBM Research)</title>
      </sec>
      <sec id="sec-4-8">
        <title>Quanzhi Li (Alibaba Group)</title>
      </sec>
      <sec id="sec-4-9">
        <title>Xiaomo Liu (S&amp;P Global Ratings)</title>
      </sec>
      <sec id="sec-4-10">
        <title>Yong Liu (Outreach.io)</title>
      </sec>
      <sec id="sec-4-11">
        <title>Raghava Mutharaju (IIIT Delhi)</title>
      </sec>
      <sec id="sec-4-12">
        <title>Arindam Pal (Data61, CSIRO)</title>
      </sec>
      <sec id="sec-4-13">
        <title>Sujan Perera (Amazon)</title>
      </sec>
      <sec id="sec-4-14">
        <title>Hemant Purohit (George Mason University)</title>
      </sec>
      <sec id="sec-4-15">
        <title>Kaushik Roy (AI Institute, University of South Carolina)</title>
      </sec>
      <sec id="sec-4-16">
        <title>Valerie Shalin (Wright State University)</title>
      </sec>
      <sec id="sec-4-17">
        <title>Kai Shu (Arizona State University)</title>
      </sec>
      <sec id="sec-4-18">
        <title>Nikhita Vedula (Ohio State University)</title>
        <p>Ruwan Wickramarachchi (AI Institute, University of South Carolina)</p>
      </sec>
      <sec id="sec-4-19">
        <title>Ke Zhang (Dataminr Inc.)</title>
      </sec>
      <sec id="sec-4-20">
        <title>Jinjin Zhao (Amazon)</title>
        <p>Research in artificial intelligence and data science is accelerating rapidly due to an unprecedented
explosion in the amount of information on the web. In parallel, we noticed immense growth in the
construction and utility of the knowledge network from Google, Netflix, NSF, and NIH. However, current
methods risk an unsatisfactory ceiling of applicability due to shortcomings in bringing homogeneity
between knowledge graphs, data mining, and deep learning. In this changing world, retrospective studies
for building state-of-the-art AI and Data science systems have raised concerns on trust, traceability, and
interactivity for prospective applications in healthcare, finance, and crisis response. We believe the
paradigm of knowledge-infused mining and learning would account for both pieces of knowledge that
accrue from domain expertise and guidance from physical models. Further, it will allow the community to
design new evaluation strategies that assess robustness and fairness across all comparable
state-of-the-art algorithms.</p>
        <p>The Workshop on Knowledge-infused Mining and Learning for Social Impact was centered around the
following thematic components: (a) Data Management: includes resource management, resource
discovery across heterogeneous and inconsistent data resources. (b) Data Usage: includes methods and
systems for visualization, representations, reasoning, and interaction. (c) Evaluation: will bring together
researchers involved at the intersection of databases, semantic web, information systems, and AI to
create new approaches and tools to benefit a broad range of policymakers (e.g. mental health professions,
education practitioners, emergency responders, and economists).</p>
        <p>The workshop will bring together researchers and practitioners from both academia and industry who
are interested in the creation and use of knowledge graphs in understanding online conversations on
crisis response (e.g., COVID-19), public health (e.g., social network analysis for mental health insights),
and finance (e.g., mining insights on the financial impact (recession, unemployment) of COVID-19 using
twitter or organizational data). Additionally, we encourage researchers and practitioners from the areas
of human-centered computing, interaction and reasoning, statistical relational mining and learning,
intelligent agent systems, semantic social network analysis, deep graph learning, and recommendation
systems.</p>
        <p>The main program of KiML’20 consist of seven papers, selected out of thirteen submissions, covering
topics related to knowledge-enabled feature elicitation, adversarial learning, crisis response, public
health, and COVID-19. We sincerely thank the authors of the submissions as well as the attendees of the
workshop. We wish to thank the members of our program committee for their help in selecting
high-quality papers. Furthermore, we are grateful to Manuela Veloso, Sriraam Natarajan, Jose Ambite, and
Pieter De Leenheer for giving keynote presentations on their recent work on Symbiotic Autonomy,
Human Allied Probabilistic Learning, Biomedical Data Science, and Data Intelligence.
Symbiotic Autonomy: Knowing When and What to Learn from Experience
Manuela M. Veloso​……………………………………………………………………………………... 1
Human Allied Probabilistic Learning
Sriraam Natarajan​ …………………………………………………………………………………….. 2
Data Intelligence in the 2020s
Pieter De Leenheer​ …………………………………………………………………………………….. 3
Semantics in Biomedical Data Science
Jose Luis Ambite ....​……………………………………………………………………………………... 4</p>
        <sec id="sec-4-20-1">
          <title>Research Papers</title>
          <p>Textual Evidence for the Perfunctoriness of Independent Medical Reviews
Adrian Brasoveanu, Megan Moodie and Rakshit Agrawal​……………………………………………5
Knowledge Intensive Learning of Generative Adversarial Networks
Devendra Dhami, Mayukh Das and Sriraam Natarajan​……………………………………………. 14
Depressive, Drug Abusive, or Informative: Knowledge-aware Study of News Exposure during
COVID-19 Outbreak
Amanuel Alambo, Manas Gaur and Krishnaprasad Thirunarayan​………………………………….20
Cost Aware Feature Elicitation
Srijita Das, Rishabh Iyer and Sriraam Natarajan​……………………………………………………26
A New Delay Differential Equation Model for COVID-19
B Shayak, Mohit Manoj Sharma and Manas Gaur​…………………………………………………....32
Public Health Implications of a delay differential equation model for COVID19
Mohit Manoj Sharma and B Shayak……………....​…………………………………………………....36
Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of
Crisis Tweets
Jitin Krishnan, Hemant Purohit, Huzefa Rangwala​…………………………………………………..42</p>
          <p>Keynote Talk 1
Symbiotic Autonomy: Knowing When and What to Learn from Experience</p>
          <p>Manuela M. Veloso</p>
          <p>Head, JPMorgan AI Research
Herbert A. Simon University Professor, School of Computer Science</p>
          <p>Carnegie Mellon University
manuela.veloso@jpmchase.com
The talk will present work on novel human-AI interaction, in which humans and AI complement
each other in their knowledge and learning. I will discuss examples in autonomous mobile
service robots and in the financial domain. I will conclude with a brief discussion of multiple
forms of available knowledge for AI systems that continuously learn from experience.
Bio:
Manuela M. Veloso is the Head of J.P. Morgan AI Research, which pursues fundamental research
in areas of core relevance to financial services, including data mining and cryptography, machine
learning, explainability, and human-AI interaction. J.P. Morgan AI Research partners with
applied data analytics teams across the firm as well as with leading academic institutions
globally. Professor Veloso is on leave from Carnegie Mellon University as the Herbert A. Simon
University Professor in the School of Computer Science, and the past Head of the Machine
Learning Department. With her students, she had led research in AI, with a focus on robotics and
machine learning, having concretely researched and developed a variety of autonomous robots,
including teams of soccer robots, and mobile service robots. Her robot soccer teams have been
RoboCup world champions several times, and the CoBot mobile robots have autonomously
navigated for more than 1,000km in university buildings. Professor Veloso is the Past President
of AAAI, (the Association for the Advancement of Artificial Intelligence), and the co-founder,
Trustee, and Past President of RoboCup. Professor Veloso has been recognized with multiple
honors, including being a Fellow of the ACM, IEEE, AAAS, and AAAI. She is the recipient of
several best paper awards, the Einstein Chair of the Chinese Academy of Science, the
ACM/SIGART Autonomous Agents Research Award, an NSF Career Award, and the Allen Newell
Medal for Excellence in Research. Professor Veloso earned a Bachelor and Master of Science
degrees in Electrical and Computer Engineering from Instituto Superior Tecnico in Lisbon,
Portugal, a Master of Arts in Computer Science from Boston University, and Master of Science
and Ph.D. in Computer Science from Carnegie Mellon University. See
www.cs.cmu.edu/~mmv/Veloso.html for her scientific publications.</p>
          <p>Keynote Talk 2
Human Allied Probabilistic Learning</p>
          <p>Sriraam Natarajan</p>
          <p>Director, Center for Machine Learning
Erik Jonsson School of Engineering and Computer Science</p>
          <p>The University of Texas at Dallas
sriraam.natarajan@utdallas.edu
Historically, Artificial Intelligence has taken a symbolic route for representing and reasoning
about objects at a higher-level or a statistical route for learning complex models from large data.
To achieve true AI, it is necessary to make these different paths meet and enable seamless
human interaction. First, I briefly will introduce learning from rich, structured, complex, and
noisy data. Next, I will present the recent progress that allows for more reasonable human
interaction where the human input is taken as “advice” and the learning algorithm combines this
advice with data. The advice can be in the form of qualitative influences, preferences over
labels/actions, privileged information obtained during training, or simple precision-recall
trade-off. Finally, I will outline our recent work on "closing-the-loop" where information is
solicited from humans as needed that allows for seamless interactions with the human expert.
While I will discuss these methods primarily in the context of probabilistic and relational
learning, I will also present our results on reinforcement learning and inverse reinforcement
learning.
Dr. Sriraam Natarajan is an Associate Professor and the Director of the Center for ML at the
Department of Computer Science at the University of Texas Dallas. He was previously an
Associate Professor and earlier an Assistant Professor at Indiana University, Wake Forest School
of Medicine, a post-doctoral research associate at the University of Wisconsin-Madison, and had
graduated with his Ph.D. from Oregon State University. His research interests lie in the field of
Artificial Intelligence, with emphasis on Machine Learning, Statistical Relational Learning and AI,
Reinforcement Learning, Graphical Models, and Biomedical Applications. He has received the
Young Investigator award from US Army Research Office, Amazon Faculty Research Award, Intel
Faculty Award, XEROX Faculty Award, Verisk Faculty Award, and the IU Trustees Teaching
Award from Indiana University. He is the program co-chair of SDM 2020 and ACM CoDS-COMAD
2020 conferences. He is the specialty chief editor of Frontiers in ML and AI journal, an editorial
board member of MLJ, JAIR, and DAMI journals and is the electronics publishing editor of JAIR.</p>
          <p>Keynote Talk 3
Data Intelligence in the Age of Accountability</p>
          <p>Pieter De Leenheer
Senior Research Fellow, Harvard Business School
Co-Founder and Chief Science Officer, Collibra Inc.</p>
          <p>pdeleenheer@hbs.edu</p>
        </sec>
        <sec id="sec-4-20-2">
          <title>Abstract: Bio:</title>
          <p>Knowledge graphs, machine learning and distributed ledgers are just a few of the emerging
intelligent technologies that unlock new options to innovate business models, augment scientific
knowledge and self-understanding, and enhance decision making. Data being a critical driver for
intelligent systems implies machine calculation may supplant human decision making in many
scenarios. The accessibility, quality and currency of data are necessary criteria to ensure these
systems produce viable innovation options that can be accounted for. But are these criteria
sufficient?
Pieter is a senior research fellow at Harvard Business School and serves as adjunct faculty at
Columbia University. He is a cofounder and former Chief Science Officer of Collibra, a unicorn
venture in data intelligence, that spun off his PhD research on community-based ontology
management. Pieter writes, teaches and advises on computing and management aspects of data
innovation, accountability and citizenship. He serves as an expert to the European Commission
and several governments; and as board member of several startups such as Gluetech.com and
Yesse.tech. Prior to cofounding the company, Pieter was a professor at VU University of
Amsterdam. He lives in New York City with his family.</p>
          <p>Keynote Talk 4
Semantics in Biomedical Data Science</p>
          <p>Jose Luis Ambite</p>
          <p>Research Team Leader, Information Sciences Institute
Associate Research Professor, University of Southern California</p>
          <p>ambite@isi.edu</p>
        </sec>
        <sec id="sec-4-20-3">
          <title>Abstract:</title>
          <p>There is an explosion of biomedical data that promises to enable novel discoveries, treatments,
and the ultimate goal of personalized medicine. These data are generated in a great variety of
forms, ranging from sensor data, to imaging, to genetics, and all types of clinical data. Moreover,
the data are often scattered across organizations, and even for the same data type are
represented in diverse structures. Thus, the need to provide a semantically consistent view, so
that the data can be meaningfully analyzed is critical. I will describe core data integration and
knowledge graph construction techniques, namely entity linkage and formal schema mappings,
with illustrative biomedical data integration applications, highlighting some novel neural
semantic similarity methods and some surprising applications of record linkage techniques, such
as efficiently finding genetically related individuals. I will discuss architectures for large scale
data integration and analysis, including sensor data. Finally, I will discuss how we can analyze
distributed datasets when the data cannot be shared for privacy or security reasons, and thus
cannot be integrated. I will describe our recent work on Heterogeneous Federated Learning that
learns common neural models from siloed data.</p>
          <p>Bio:
Dr. Jose Luis Ambite is an Associate Research Professor at the Computer Science Department,
and a Research Team Leader at the Information Sciences Institute, at the University of Southern
California. His core expertise is on information integration, including query rewriting under
constraints, learning schema mappings, and entity linkage. Dr. Ambite research interests include
databases, knowledge representation, semantic web, semantic similarity, scientific workflows,
and biomedical data science. He has published widely in these topics. He regularly serves as
reviewer for funding organizations, journals and major conferences. In the last years, he has
focused on developing novel approaches for integration, analysis, and dissemination of
biomedical and genetic data within several large NIH-funded projects, such as ​PRISMS-study​,
NIMH Repository and Genetics Resource​, ​SchizConnect​, ​Population Architecture using Genomics
and Epidemiology​, and ​Education Resource Discovery Index​.</p>
          <p>Textual Evidence for the Perfunctoriness of Independent</p>
          <p>Medical Reviews
We examine a database of 26,361 Independent Medical Reviews
(IMRs) for privately insured patients, handled by the California
Department of Managed Health Care (DMHC) through a private
contractor. IMR processes are meant to provide protection for
patients whose doctors prescribe treatments that are denied by their
health insurance (either private insurance or the insurance that is
part of their worker comp; we focus on private insurance here).</p>
          <p>
            Laws requiring IMR were established in California and other states
because patients and their doctors were concerned that health
insurance plans deny coverage for medically necessary services. We
analyze the text of the reviews and compare them closely with a
sample of 50000 Yelp reviews [
            <xref ref-type="bibr" rid="ref12">19</xref>
            ] and the corpus of 50000 IMDB
movie reviews [
            <xref ref-type="bibr" rid="ref3">10</xref>
            ]. Despite the fact that the IMDB corpus is twice
as large as the IMR corpus, and the Yelp sample contains almost
twice as many reviews, we can construct a very good language
model for the IMR corpus using inductive sequential transfer
learning, specifically ULMFiT [
            <xref ref-type="bibr" rid="ref1">8</xref>
            ], as measured by the quality of text
generation, as well as low perplexity (11.86) and high categorical
accuracy (0.53) on unseen test data, compared to the larger Yelp
and IMDB corpora (perplexity: 40.3 and 37, respectively; accuracy:
0.29 and 0.39). We see similar trends in topic models [
            <xref ref-type="bibr" rid="ref10">17</xref>
            ] and
classification models predicting binary IMR outcomes and binarized
sentiment for Yelp and IMDB reviews. We also examine four other
corpora (drug reviews [6], data science job postings [
            <xref ref-type="bibr" rid="ref2">9</xref>
            ], legal case
summaries [5] and cooking recipes [
            <xref ref-type="bibr" rid="ref4">11</xref>
            ]) to show that the IMR
results are not typical for specialized-register corpora. These results
indicate that movie and restaurant reviews exhibit a much larger
variety, more contentful discussion, and greater attention to detail
compared to IMR reviews, which points to the possibility that a
crucial consumer protection mandated by law fails a sizeable class
of highly vulnerable patients.
          </p>
          <p>CCS CONCEPTS</p>
          <p>• Computing methodologies → Latent Dirichlet allocation;
Neural networks.</p>
          <p>KEYWORDS</p>
          <p>AI for social good, state-managed medical review processes,
language models, topic models, sentiment classification
ACM Reference Format:
Adrian Brasoveanu, Megan Moodie, and Rakshit Agrawal. 2020. Textual
Evidence for the Perfunctoriness of Independent Medical Reviews. In
Proceedings of KDD Workshop on Knowledge-infused Mining and Learning (KiML’20).
, 9 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
Independent Medical Review (IMR) processes are meant to provide
protection for patients whose doctors prescribe treatments that are
denied by their health insurance – either private insurance or the
insurance that is part of their workers’ compensation. In this paper,
we focus exclusively on privately insured patients. Laws requiring
IMR processes were established in California and other states in
the late 1990s because patients and their doctors were concerned
that health insurance plans deny coverage for medically necessary
services to maximize profit. 1</p>
          <p>As aptly summarized in [1], IMR is regularly used to settle
disputes between patients and their health insurers over what is
medically necessary or experimental/investigational care. Medical
necessity disputes occur between health plans and patients because
the health plan disagrees with the patient’s doctor about the
appropriate standard of care or course of treatment for a specific
condition. Under the current system of managed care in the U.S.,
services rendered by a health care provider are reviewed to
determine whether the services are medically necessary, a process
referred to as utilization review (UR). UR is the oversight
mechanism through which private insurers control costs by ensuring
that only medically necessary care, covered under the contractual
terms of a patient’s insurance plan, is provided. Services that are
not deemed medically necessary or fall outside a particular plan
are not covered.</p>
          <p>Procedures or treatment protocols are deemed experimental or
investigational because the health plan – but not necessarily the
patient’s doctor, who in many cases has enough clinical confidence
in a treatment to order it – considers them non-routine medical
care, or takes them to be scientifically unproven to treat the specific
condition, illness, or diagnosis for which their use is proposed.</p>
          <p>
            It is important to realize that the IMR process is usually the
third and final stage in the medical review process. The typical
progression is as follows. After in-person and possibly repeated
examination of the patient, the doctor recommends a treatment,
1For California, see the Friedman-Knowles Act of 1996, requiring California health
plans to provide external independent medical review (IMR) for coverage denials. As
of late 2002, 41 states and the District of Columbia had passed legislation creating an
IMR process. In 34 of these states, including California, the decision resulting from the
IMR is binding to the health plan. See [
            <xref ref-type="bibr" rid="ref8">1, 15</xref>
            ] for summaries of the political and legal
history of the IMR system, and [2] for an early partial survey of the DMHC IMR data.
which is then submitted for approval to the patient’s health plan.
          </p>
          <p>If the treatment is denied in this first stage, both the doctor and
the patient may file an appeal with the health plan, which triggers
a second stage of reviews by the health-insurance provider, for
which a patient can supply additional information and a doctor
may engage in what is known as a “peer to peer” discussion with a
health-insurance representative. If these second reviews uphold the
initial denial, the only recourse the patient has is the state-regulated
IMR process, and per California law, an IMR grievance form (and
some additional information) is included with the denial letter.</p>
          <p>An IMR review must be initiated by the patient and submitted to
the California Department of Managed Health Care (DMHC), which
manages IMRs for privately-insured patients. Motivated treating
physicians may provide statements of support for inclusion in the
documentation provided to DMHC by the patient, but in theory
the IMR creates a new relationship of care between the
reviewing physician(s) hired by a private contractor on behalf of DMHC,
and the patient in question. The reviewing physicians’ decision is
supposed to be made based on what is in the best interest of the
patient, not on cost concerns. It is this relation of care that constitutes
the consumer protection for which IMR processes were legislated.</p>
          <p>Understandably, given that the patients in question may be ill or
disabled or simply discouraged by several layers of cumbersome
bureaucratic processes, there is a very high attrition from the initial
review to the final, IMR, stage. That is, only the few highly
motivated and knowledgeable patients – or the extremely desperate –
get as far as the IMR process.</p>
          <p>The IMR process is regulated by the state, but it is actually
conducted by a third party. At this time (2019), the provider in
California and several other states across the US is MAXIMUS Federal
Services, Inc.2 The costs associated with the IMR review, at least
in California, are covered by health insurers. It is DMHC’s and
MAXIMUS’s responsibility to collect all the documentation from
the patient, the patient’s doctor(s) and the health insurer. There
are no independent checks that all the documentation has actually
been collected, however, and patients do not see a final list of what
has been provided to the reviewer prior to the IMR decision itself
(a post facto list of file contents is mailed to patients along with the
ifnal, binding, decision; it is unclear what recourse a patient may
have if they find pertinent information was missing from the review
ifle). Once the documentation is assembled, MAXIMUS forwards it
to anywhere from one to three reviewers, who remain anonymous,
but are certified by MAXIMUS to be appropriately credentialed
and knowledgeable about the treatment(s) and condition(s) under
review. The reviewer submits a summary of the case, and also a
rationale and evidence in support of their decision, which is a binary
Upheld/Overturned decision about the medical service. IMR
reviewers do not enter a consultative relationship with the patient, doctor
or health plan – they must render an uphold/overturn decision
based solely on the provided medical records. However, as noted
above, they are in an implied relationship of care to the patient, a
point to which we return in the Discussion section below (§4).</p>
          <p>While insurance carriers do not provide statistics about the
percentage of requested treatments that are denied in the initial stage,
looking at the process as a whole, a pattern of service denial aimed
2https://www.maximus.com/capability/appeals-imr
to maximize profit, rather than simply maintain cost efectiveness,
seems to emerge. Typically, the argument for denial contends that
the evidence for the beneficial efects of the treatment fails the
prevailing standard of scientific evidence. This prevailing standard
invoked by IMR reviewers is usually randomized control trials
(RCTs), which are expensive, time-consuming trials that are run by
large pharmaceutical companies only if the treatment is ultimately
estimated to be profitable.</p>
          <p>RCTs, however, have known limits: they “require minimal
assumptions and can operate with little prior knowledge [which] is
an advantage when persuading distrustful audiences, but it is a
disadvantage for cumulative scientific progress, where prior
knowledge should be built upon, not discarded.” [3] Inflexibly applying
the RCT “gold standard” in the IMR process is often a way to
ignore the doctors’ knowledge and experience in a way that seems
superficially well-reasoned and scientific. “RCTs can play a role in
building scientific knowledge and useful predictions” – and we add,
treatment recommendations – “only [. . . ] as part of a cumulative
program, [in combination] with other methods.” [3]</p>
          <p>Notably, the experimental/investigational category of treatments
that get denied often includes promising treatments that have not
been fully tested in clinical RCTs – because the treatment is new or
the condition is rare in the population, so treatment development
costs might not ultimately be recovered. Another common category
of experimental/investigational denials involves “of-label” drug
uses, that is, uses of FDA-approved pharmaceuticals for a purpose
other than the narrow one for which the drug was approved.
1.2</p>
          <p>Main argument and predictions
Recall that these ‘experimental’ treatments or of-label uses are
recommended by the patient’s doctor, and therefore their potential
benefits are taken to outweigh their possible negative efects. The
recommending doctor is likely very familiar with the often lengthy,
tortuous and highly specific medical history of the patient , and with
the list of ‘less experimental’ treatments that have been proven
unsuccessful or have been removed from consideration for
patientspecific reasons. It is also important to remember that many rare
conditions have no “on-label” treatment options available, since
expensive RCTs and treatment approval processes are not undertaken
if companies do not expect to recover their costs, which is likely if
the potential ‘market’ is small (few people have the rare condition).</p>
          <p>Therefore, our main line of argumentation is as follows.</p>
          <p>• Since IMRs are the final stage in a long bureaucratic process
in which health insurance companies keep denying coverage
for a treatment repeatedly recommended by a doctor as
medically necessary, we expect that the issue of medical
necessity is non-trivial when that specific patient and that
specific treatment are carefully considered.
• We should therefore expect the text of the IMRs, which
justiifes the final determination, to be highly individualized and
argue for that final decision (whether congruent with the
health plan’s decision or not) in a way that involves the
particulars of the treatment and the particulars of the patient’s
medical history and conditions.</p>
          <p>Thus, we expect a reasoned, thoughtful IMR to not be highly
generic and templatic / predictable in nature. For instance, legal
documents may be highly templatic as they discuss the application
of the same law or policy across many diferent cases, but a response
carefully considering the specifics of a medical case reaching the
IMR stage is not likely to be similar to many other cases. We only
expect high similarity and ‘templaticity’ for IMR reviews if they are
reduced to a more or less automatic application of some prespecified
set of rules (rubber-stamping).
1.3</p>
          <p>Main results, and their limits
Concomitantly with this quantitative study, we conducted
preliminary qualitative research with a focus on pain management and
chronic conditions. We investigated the history of the IMR process,
in addition to having direct experience with it. We had detailed
conversations with doctors in Northern California and on private
social media groups formed around chronic conditions and pain
management. This preliminary research reliably points towards the
possibility that IMR reviews are perfunctory, and that this crucial
consumer protection mandated by law seems to fail for a sizeable
class of highly vulnerable patients. In this paper, we focus on the
text of the IMR decisions and attempt to quantify the evidence for
the perfunctoriness of the IMR process that they provide.</p>
          <p>The text of the IMR findings does not provide unambiguous
evidence about the quality and appropriateness of the IMR process.</p>
          <p>If we had access to the full, anonymized patient files submitted to
the IMR reviewers (in addition to the final IMR decision and the
associated text), we might have been able to provide much stronger
evidence that IMRs should have a significantly higher percentage of
overturns, and that the IMR process should be improved in various
ways, e.g., (i) patients should be able to check that all the relevant
documentation has been collected and will be reviewed, and (ii)
the anonymous reviewers should be held to higher standards of
doctor-patient care. At the very least, one would want to compare
the reports/letters produced by the patient’s doctor(s) and the IMR
texts. However, such information is not available and there are no
visible signs suggesting potential availability in the near future.</p>
          <p>The information that is made available by DMHC constitutes the
IMR decision – whether to uphold or overturn the health plan
decision –, the anonymized decision letter, and information about
the requested treatment category (also available in the letter). We,
therefore, had to limit ourselves to the text of the DMHC-provided
IMR findings in our empirical analysis.</p>
          <p>A qualitative inspection of the corpus of IMR decisions made
available by the California DMHC site as of June 2019 (a total of
26,631 cases spanning the years 2001-2019) indicates that the
reviews – as documented in the text of the findings – focus more
on the review procedure and associated legalese than on the
actual medical history of the patient and the details of the case. For
example, decisions for chronic pain management seem to mostly
rubber-stamp the Medical Treatment Utilization Schedule (MTUS)
guidelines, with very little consideration of the rarity of the
underlying condition(s) (see our comments about RCTs above), or
a thoughtful evaluation of the risk/benefit profile of the denied
treatment relative to the specific medical history of the patient
(assuming this history was adequately documented to begin with).</p>
          <p>The goal in this paper is to investigate to what extent
Natural Language Processing (NLP) / Machine Learning (ML)
methods that are able to extract insights from large corpora point in
the same direction, thus mitigating cherry-picking biases that are
sometimes associated with qualitative investigations. In addition
to the IMR text, we perform a comparative study with additional
English-language datasets in an attempt to eliminate data-specific
and problem-specific biases.</p>
          <p>
            • We analyze the text of the IMR reviews and compare them
with a sample of 50,000 Yelp reviews [
            <xref ref-type="bibr" rid="ref12">19</xref>
            ] and the corpus of
50,000 IMDB movie reviews [
            <xref ref-type="bibr" rid="ref3">10</xref>
            ].
• As the size of data has significant consequences for
languagemodel training, and NLP/ML models more generally, we
expect models trained on the Yelp and IMDB corpora to
outperform models trained on the IMR corpus, given that
the IMDB corpus is twice as large as the IMR corpus, and
the Yelp samples contain almost twice as many reviews.
• In this paper, we instead demonstrate that we were able
to construct a very good language model for the IMR
corpus using inductive sequential transfer learning, specifically
          </p>
          <p>
            ULMFiT [
            <xref ref-type="bibr" rid="ref1">8</xref>
            ], as measured by the quality of text generation.
• In addition, the model achieves a much lower perplexity
(11.86) and a higher categorical accuracy (0.53) on unseen
test data, compared to models trained on the larger Yelp
and IMDB corpora (perplexity: 40.3 and 37, respectively;
categorical accuracy: 0.29 and 0.39).
• We see similar trends in topic models [
            <xref ref-type="bibr" rid="ref10">17</xref>
            ] and
classification models predicting binary IMR outcomes and binarized
sentiment for Yelp and IMDB reviews.
          </p>
          <p>
            These results indicate that movie and restaurant reviews
exhibit a much larger variety, more contentful discussion, and greater
attention to detail compared to IMR reviews. In an attempt to
mitigate confirmation bias, as well as potentially significant register
diferences between IMRs and movie or restaurant reviews, we
examine four additional corpora: drug reviews [6], data science
job postings [
            <xref ref-type="bibr" rid="ref2">9</xref>
            ], legal case summaries [5] and cooking recipes [
            <xref ref-type="bibr" rid="ref4">11</xref>
            ].
          </p>
          <p>These specialized-register corpora are potentially more similar to
IMRs than IMDB or Yelp: the texts are more likely to be highly
similar, include boilerplate text and have a templatic/standardized
structure. We find that predictability of IMR texts, as measured by
language-model perplexity and categorical accuracy, is higher than
all the comparison datasets by a good margin.</p>
          <p>Based on these empirical comparisons, we conclude that we
have strong evidence that the IMR reviews are perfunctory and,
therefore, that a crucial consumer protection mandated by law
seems to fail for a sizeable class of highly vulnerable patients. The
paper is structured as follows. In Section 2, we discuss the datasets
in detail, with a focus on the nature and characteristics of the IMR
data. In Section 3, we discuss the models we use to analyze the IMR,
Yelp and IMDB datasets, as well as the four auxiliary corpora (drug
reviews, data science jobs, legals cases and recipes). The section also
compares and discusses the results of these models. Section 4 puts
all the results together into an argument for the perfunctoriness of
the IMRs. Section 5 concludes the paper and outlines directions for
future work.
2
2.1</p>
          <p>THE DATASETS</p>
          <p>The IMR dataset
The IMR dataset was obtained from the DMHC website in June
20193 and was minimally preprocessed. It contains 26,361 cases /
observations and 14 variables, 4 of which are the most relevant:
• TreatmentCategory: the main treatment category;
• ReportYear: year the case was reported;
• Determination: indicates if the determination was upheld or</p>
          <p>overturned;
• Findings: a summary of the case findings.</p>
          <p>
            The top 14 treatment categories (with percentages of total ≥ 2%),
together with their raw counts and percentages are provided in
Table 1.
3https://data.chhs.ca.gov/dataset/independent-medical-review-imr-determinationstrend.
As comparison datasets, we use the IMDB movie-review dataset [
            <xref ref-type="bibr" rid="ref3">10</xref>
            ],
which has 50,000 reviews and a binary positive/negative sentiment
classification associated with each review. This dataset will be
particularly useful as a baseline for our ULMFiT transfer-learning
language models (and subsequent transfer-learning classification
models), where we show that we obtain results for the IMDB dataset
that are similar to the ones in the original ULMFiT paper [
            <xref ref-type="bibr" rid="ref1">8</xref>
            ].
          </p>
          <p>There are 50,000 movie reviews in the IMDB dataset, evenly split
into negative and positive reviews. The histogram of text lengths
for IMDB reviews is provided in Figure 2. The reviews contain a
total of 11,557,297 words. The mean length of a review is 231.15
words, with an SD of 171.32.</p>
          <p>
            We select a sample of 50,000 Yelp (mainly restaurant) reviews [
            <xref ref-type="bibr" rid="ref12">19</xref>
            ],
with associated binarized negative/positive evaluations, to provide
a comparison corpus intermediate between our DMHC dataset and
the IMDB dataset. From a total of 560,000 reviews (evenly split
between negative and positive), we draw a weighted random sample
with the weights provided by the histogram of text lengths for the
IMR corpus. The resulting sample contains 25,809 (52%) negative
reviews and 24,191 (48%) positive reviews. The histogram of text
lengths for Yelp reviews is also provided in Figure 2. The reviews
contain a total of 7,038,467 words. The mean length of a review is
140.77 words, with an SD of 71.09.
h
ltn0.004
g
e
vgne
ifa0.003
sxo
t
te0.002
f
o
#
li)e0.001
zda
m
r
(N0.000
o
h
t
g
n
e
ln0.005
vge
ifa0.004
soe
tx0.003
ft)#0.002
o
d
e
liz0.001
a
m
r
0 200 IMR text length (# of wo8rd0s0) 1000 1200 (N0.000 0
          </p>
          <p>o
400 600
h
t
g
n
le0.008
vgne
ifa0.006
sxo
t
fte0.004
o
#
il)e0.002
zda
m
r
(N0.000 0
o
500 1000 1500 2000 2500
IMDB-review text length (# of words)</p>
          <p>
            Yelp-review text length (# of words) 800
200 400 600
2.3 Four auxiliary datasets
We will also analyze four other specialized-register corpora: drug
reviews [6], data science (DS) job postings [
            <xref ref-type="bibr" rid="ref2">9</xref>
            ], legal case reports [5]
and cooking recipes [
            <xref ref-type="bibr" rid="ref4">11</xref>
            ]. The modeling results for these
specializedregister corpora will enable us to better contextualize and evaluate
the modeling results for the IMR, IMDB and Yelp corpora, since
these four auxiliary datasets might be seen as more similar to the
IMR corpus than movie or restaurant reviews. The drug-review
corpus contains reviews of pharmaceutical products, which are
closer in subject matter to IMRs than movie/restaurant reviews.
          </p>
          <p>The other three corpora are all highly specialized in register, just
like the IMRs, with two of them (DS jobs and legal cases) particularly
similar to the IMRs in that they involve templatic texts containing
information aimed at a specific professional sub-community.</p>
          <p>These four corpora are very diferent from each other and from
the IMR corpus in terms of (i) the number of texts that they contain
and (ii) the average text length (number of words per text). Because
of this, there was no obvious way to sample from them and from
the IMR, IMDB and Yelp corpora in such a way that the resulting
samples were both roughly comparable with respect to the total
number of texts and average text length, and also large enough to
obtain reliable model estimates. We therefore analyzed these four
corpora as a whole.</p>
          <p>The drug-review corpus includes 132,300 drugs reviews – more
than the double the number of texts in the IMDB and Yelp datasets,
and more than 4 times the number of texts in the IMR dataset. From
the original corpus of 215,063 reviews, we only retained the reviews
associated with a rating of 10, which we label as positive reviews,
and a rating of 1 through 5, which we label as negative reviews.4
4We did this so that we have a fairly balanced dataset (68,005 positive drug reviews and
64,295 negative reviews) to estimate classification models like the ones we report for
the IMR, IMDB and Yelp corpora in the next section. For completeness, the drug-review
classification results on previously unseen test data are as follows: logistic regression</p>
          <p>The histogram of text lengths for drug reviews is provided in
Figure 3. The reviews contain a total of 11,015,248 words, with a
mean length of 83.26 words per review (significantly shorter than
the IMR/IMDB/Yelp texts) and an SD of 45.73.</p>
          <p>The DS corpus includes 6,953 job postings (about a quarter of
the texts in the IMR corpus), with a total of 3,731,051 words. The
histogram of text lengths is provided in Figure 3. The mean length
of a job posting is 536.61 words (more than twice as long as the
IMR/IMDB/Yelp texts), with an SD of 254.06.</p>
          <p>There are 3,890 legal-case reports (even fewer than DS job
postings), with a total of 25,954,650 words (about 5 times larger than
the IMR corpus). The histogram of text lengths for the legal-case
reports is provided in Figure 3. The mean length of a report is 6,672.15
words (a degree of magnitude longer than IMR/IMDB/Yelp), with a
very high SD of 11,997.98.</p>
          <p>Finally, the recipe corpus includes more than 1 million texts:
there are 1,029,719 recipes, with a total of 117,563,275 words (very
large compared to our other corpora). The histogram of text lengths
for the recipes is provided in Figure 3. The mean length of a recipe
is 114.17 words (close to the length of a drug review, and roughly
half of an IMR), with an SD of 90.54.
3 THE MODELS
In this section, we analyze the text of the IMR findings and its
predictiveness with respect to IMR outcomes. We systematically
compare these results with the corresponding ones for the IMDB
and Yelp corpora. The datasets were split into training (80%),
validation (10%) and test (10%) sets. Test sets were only used for the
ifnal model evaluation.
accuracy: 77.89%; accuracy of multilayer perceptron with a 1,000-unit hidden layer
and a ReLU non-linearity: 83.18%; ULMFiT classification model accuracy: 96.12%.</p>
          <p>We start with baseline classification models (logistic regressions
and logistic multilayer perceptrons with one hidden layer) to
establish that the reviews in all three datasets under consideration
are highly predictive of the associated binary outcomes. Once the
predictiveness, hence, relevance, of the text is established, we turn
to an in-depth analysis of the texts themselves by means of topic
and language models. We see that the text of the IMR reviews is
significantly diferent (more predictable, less diverse / contentful)
when compared to movie and restaurant reviews. We then turn to
a final set of classification models that leverage transfer learning
from the language models to see how predictive the texts can
really be with respect to the associated binary outcomes. Finally, we
report the results of estimating language models for the 4 auxiliary
datasets introduced in the previous section.</p>
          <p>The main conclusion of this extensive series of models is that
the IMR corpus is an outlier, and it would be easy to make the
IMR process fully automatic: it is pretty straightforward to train
models that generate high-quality, realistic IMR reviews and
generate binary decisions that are very reliably associated with these
reviews. In contrast, movie and restaurant reviews produced by
unpaid volunteers (as well as the 4 auxiliary datasets) exhibit more
human-like depth, sophistication and attention to detail, so current
NLP models do not perform as well on them.
3.1</p>
          <p>Classification models
We regress outcomes (Upheld/Overturned for IMR or negative/positive
sentiment for IMDB/Yelp) against the text of the corresponding
ifndings / reviews. For the purposes of these basic classification
models, as well as the topics models discussed in the following
subsection, the texts were preprocessed as follows. First, we removed
stop words; for the IMR dataset, we also removed the following
high-frequency words: patient, treatment, reviewer, request,
medical and medically, and for the IMDB dataset, we also removed the
words film and movie. After part-of-speech tagging, we retained
only nouns, adjectives, verbs and adverbs, since lexical meanings
provide the most useful information for logistic (more generally,
feed-forward) models and topic models. The resulting dictionary
for the IMR dataset had 23,188 unique words. We ensured that
the dictionaries for the IMDB and Yelp datasets were also between
23,000 and 24,000 words by eliminating infrequent words. Bounding
the dictionaries for each dataset to a similar range helps mitigate
dataset-specific modeling biases: having diferently-sized
vocabularies leads to diferently-sized parameter spaces for the models.</p>
          <p>We extracted features by converting each text into sparse
bag-ofwords vectors of dictionary length, which recorded how many times
each token occurred in the text. These feature representations were
the input to all the classifier models we consider in this subsection.</p>
          <p>The multilayer perceptron model had a single hidden layer with
1,000 units and a ReLU non-linearity. The classification accuracies
on the test data for all three datasets are provided in Table 3.
logistic regression
multilayer perceptron</p>
          <p>We see that the text of the findings / reviews is highly predictive
of the associated binary outcomes, with the highest accuracy for the
IMR dataset despite the fact that it contains half the observations
of the other two data sets. We can therefore turn to a more
indepth analysis of the texts to understand what kind of textual
justification is used to motivate the IMR binary decisions. To that
end, we examine and compare the results of two
unsupervised/selfsupervised types of models: topic models and language models.
3.2</p>
          <p>
            Topic models
Topic modeling [
            <xref ref-type="bibr" rid="ref10">17</xref>
            ] is an unsupervised method that distills
semantic properties of words and documents in a corpus in terms of
probabilistic topics. The most widespread measure for topic model
evaluation is the coherence score [
            <xref ref-type="bibr" rid="ref7">14</xref>
            ]. Typically, as we increase
the number of topics from very few, say, 4 topics, to more of them,
we see an increase in coherence score that tends to level out after
a certain number of topics. When modeling the IMDB and Yelp
datasets, we see exactly this behavior, as shown in Figure 4.
          </p>
          <p>In contrast, the 4-topic model has the highest coherence score
(0.56) for the IMR data set, also shown in Figure 4. Furthermore,
as we add more topics, the coherence score drops. As the word
clouds for the 4-topic model in Figure 5 show, these 4 topics mostly
reflect the legalese associated with the IMR review procedure and
very little, if anything, of the treatments and conditions that were
the main point of the review. In contrast, the corresponding
highscoring topic models for the IMDB and Yelp datasets reflect actual
features of movies, e.g., family-life movies, westerns, musicals etc.,
or breakfast/lunch places, restaurants, shops, bars, hotels etc.</p>
          <p>Recall that IMRs are the legally-mandated last resort for patients
seeking treatments (usually) ordered by their doctors, and which
their health plan refuses to cover. The reviews are conducted
exclusively based on documentation. Putting aside the fact that it is
unclear how much efort is taken to ensure that the documentation
is complete, especially for patients with extensive and complicated
health records, we see that relatively little specific information
about a patients’ medical history, condition(s), or the recommended
treatments are reflected in the text of these decisions. The text seems
to consist largely of legalese about the IMR process, the health plan
/ providers, basic demographic information about the patient, and
generalities about the medical service or therapy requested for the
enrollee’s condition.
3.3</p>
          <p>Language models with transfer learning
Language models, specifically using neural networks, are usually
recurrent-network or transformer based architectures designed
to learn textual distributional patterns in an unsupervised or
selfsupervised manner. Recurrent-network models – on which we
focus here – commonly use Long Short-Term Memory (LSTM) [7]
“cells,” which are able to learn long-term dependencies in sequences.</p>
          <p>
            Representing text as a sequence of words, language models build
rich representations of the words, sentences, and their relations
within a certain language. We estimate a language model for the
IMR corpus using inductive sequential transfer learning, specifically
ULMFiT [
            <xref ref-type="bibr" rid="ref1">8</xref>
            ]. Just as [
            <xref ref-type="bibr" rid="ref1">8</xref>
            ], we use the AWD-LSTM model [
            <xref ref-type="bibr" rid="ref5">12</xref>
            ], a vanilla
LSTM with 4 kinds of dropout regularization, embedding size of
400, 3 LSTM layers (1,150 units per layer), and a BPTT of size 70.
Coherence scores
          </p>
          <p>Coherence scores</p>
          <p>Coherence scores</p>
          <p>
            The AWD-LSTM model is pretrained on Wikitext-103 [
            <xref ref-type="bibr" rid="ref6">13</xref>
            ],
consisting of 28, 595 preprocessed Wikipedia articles, with a total of 103
million words. This pretrained model is fairly simple (no attention,
skip connections etc.), and the pretraining corpus is of modest size.
          </p>
          <p>
            To obtain our final language models for the IMR, IMDB and
Yelp corpora, we fine-tune the pretrained AWD-LSTM model using
discriminative [
            <xref ref-type="bibr" rid="ref11">18</xref>
            ] and slanted triangular [
            <xref ref-type="bibr" rid="ref1 ref9">8, 16</xref>
            ] learning rates. We
do the same kind of minimal text preprocessing as in [
            <xref ref-type="bibr" rid="ref1">8</xref>
            ].
          </p>
          <p>The perplexity and categorical accuracy for the 3 language
models are provided in Table 4. The perplexity for the IMR findings is
much lower than for the IMDB / Yelp reviews, and the language
model can correctly guess the next word more than half the time.</p>
          <p>The IMR language model can generate high quality and largely
coherent text, unlike the IMDB / Yelp models. Two samples of
generated text are provided below (the ‘seed’ text is boldfaced).</p>
          <p>• The issue in this case is whether the requested partial
hospitalization program ( PHP ) services are medically necessary
for treatment of the patient ’s behavioral health condition
. The American Psychiatric Association ( APA ) treatment
guidelines for patients with eating disorders also consider
PHP acute care to be the most appropriate setting for
treatment , and suggest that patients should be treated in the least
restrictive setting which is likely to be safe and efective .</p>
          <p>The PHP was initially recommended for patients who were
based on their own medical needs , but who were
• The patient was admitted to a skilled nursing facility (</p>
          <p>SNF ) on 12 / 10 / 04 . The submitted documentation states
the patient was discharged from the hospital on 12 / 22 /
04 . The following day the patient ’s vital signs were
stable . The patient had been ambulating to the community
with assistance with transfers , but has not had any recent
medical or rehabilitation therapy . The patient had no new
medical problems and was discharged in stable condition .</p>
          <p>The patient has requested reimbursement for the inpatient
acute rehabilitation services provided</p>
          <p>We see that the IMR language model is highly performant,
despite the simple model architecture we used, the modest size of
the pretraining corpus, and the small size of the IMR corpus. The
quality of the generated text is also very high, particularly given
all these limitations.
3.4</p>
          <p>
            Classification with transfer learning
We further fine-tune the language models discussed in the previous
subsection to train classifiers for the three datasets. Following [
            <xref ref-type="bibr" rid="ref1">4, 8</xref>
            ],
we gradually unfreeze the classifier models to avoid catastrophic
forgetting.
          </p>
          <p>The results of evaluating the classifiers on the withheld test
sets are provided in Table 5. Despite the fact that the IMR dataset
contains half of the classification observations of the other two
datasets, we obtain the highest level of accuracy when predicting
binary Upheld/Overturned decisions based on the text of the IMR
ifndings.
We also estimated topic and language models for the 4 auxiliary
corpora (drug reviews, DS jobs, legal cases and cooking recipes).</p>
          <p>The associations between coherence scores and number of topics
for these 4 corpora was similar to the ones plotted in Figure 4 above
for the IMDB and Yelp corpora. For all 4 auxiliary corpora, the best
topic models had at least 14 topics, often more, with coherence
scores above 0.5. The quality of the topics was also high, with
intuitively coherent and contentful topics (just like IMDB / Yelp).</p>
          <p>The perplexity and accuracy of the ULMFiT language models
on previously-withheld test data are provided in Table 6, which
contains the results for all the 7 datasets under consideration in
this paper. We see that the predictability of the IMR corpus, as
reflected in its perplexity and categorical accuracy scores, is still
clearly higher than the 4 auxiliary corpora. The perplexity of the
legal-case corpus (18.17) is somewhat close to the IMR perplexity
(11.86), but we should remember that the legal-case corpus is about
5 times larger than the IMR corpus. Furthermore, the legal-case
categorical accuracy of 43% is still substantially lower than the IMR
accuracy of 53%. Notably, even the recipe corpus, which is about 20
times larger than the IMR corpus (≈ 117.5 vs. ≈ 5.5 million words)
does not have test-set scores similar to the IMR scores.</p>
          <p>The results for these 4 auxiliary corpora indicate that the IMR
corpus is an outlier, with very highly templatic and generic texts.
4</p>
          <p>DISCUSSION
The models discussed in the previous section show that
languagemodel learning is significantly easier for IMRs compared to the other
6 corpora. As can be seen in Table 6, perplexity in the language
model for IMR reviews is clearly lower than even legal cases, for
which we expect highly templatic language and high similarity
between texts. This pattern can be clearly observed in Figure 6,
with the IMR corpus clearly at the very end of the high-to-low
predictability spectrum.</p>
          <p>One would not expect such highly predictable texts in an ideal
scenario, where each medical review is thorough, and each
decision is accompanied by strong medical reasoning relying on the
specifics of the case at hand, and based on an objective physician’s,
or team of physicians’, opinion as to what is in the patient’s best
interest. Arguably, these medically complex cases are as diverse as
Hollywood blockbusters or fashionable restaurants – the patients
themselves certainly experience them as unique and meaningful
–, and their reviews should be similarly diverse, or at most as
templatic as a job posting or a cooking recipe. We wouldn’t expect
40
35
ity30
x
e
l
rp25
e
P
20
15
30
IMR</p>
          <p>Legalcases</p>
          <p>DSjobs</p>
          <p>Yelp
these medical reviews to be so much more predictable and generic
than less socially consequential reviews of movies and restaurants.</p>
          <p>What are the ethical and potentially legal consequences of these
ifndings? First, while state legislators assume we have strong
healthinsurance related consumer protections in place, an image DMHC
goes to great lengths to promote, we find the reviews to be
upholding insurance plan denials at rates that exceed what one might
expect, given that the treatments in question are frequently being
ordered by a treating physician, and that the IMR process is the last
stage in a bureaucratically laborious (hence high-attrition) process
of appealing health-plan denials.</p>
          <p>Second, given that the IMR process creates an implied relation
of care between the reviewers hired by MAXIMUS and the patient –
since reviewers are, after all, being entrusted with the best interests
of the patient without regard to cost –, one can hardly say that they
are fulfilling their obligations as doctors to their patient with such
seemingly rote, perfunctory reviews.</p>
          <p>Third, if IMR processes were designed to make sure that (i)
treatment decisions are being made by doctors, not by profit-driven
businesses, and (ii) insurance companies cannot welch on their
responsibilities to plan members, one must wonder whether
prescribing physicians are wrong more than half the time. Do American
doctors really order so many erroneous, medically unnecessary
treatments and medications? If so, how is it possible that they are
so committed and confident in them that they are willing to escalate
the appeal process all the way to the state-managed IMR stage?
Or is it that IMRs often serve as a final rubber stamp for
healthinsurance plan denials, failing their stated mission of protecting a
vulnerable population?</p>
          <p>We end this discussion section by briefly reflecting on the way
we used ML/NLP methods for social good problems in this paper.</p>
          <p>Overwhelmingly, the social-good applications of these methods
and models seem to be predictive in nature: their goal is to improve
the outcomes of a decision-making process, and the improvement
is evaluated according to various performance-related metrics. An
important class of metrics that are currently being developed have
to do with ethical, or ‘safe,’ uses of ML/AI models.</p>
          <p>In contrast, our use of ML models in this paper was analytical,
with the goal of extracting insights from large datasets that enable
us to empirically evaluate how well an established decision-making
process with high social impact functions. Data analysis of this
kind, more akin to hypothesis testing than to predictive modeling,
is in fact one of the original uses of statistical models / methods.
limited to (i) adding ways for patients to check that all the
relevant documentation has been collected and will be reviewed, and
(ii) identifying ways to hold the anonymous reviewers to higher
Unfortunately, using ML models in this way does not
straightforstandards of doctor-patient care.
wardly lead to plots showing how ML models obviously improve
metrics like the eficiency or cost of a process. We think, however,
that there are as many socially beneficial opportunities for this kind
of data-analysis use of ML modeling as there are for its predictive
uses. The main diference between them seems to be that the
dataanalysis uses do not lead to more-or-less immediately measurable
products. Instead, they are meant to become part of a larger
argument and evaluation of a socially and politically relevant issue,
e.g., the ethical status of current health-insurance related practices
and consumer protections discussed here. What counts as ‘success’
when ML models are deployed in this way is less immediate, but
could provide at least as much social good in the long run.
5</p>
          <p>CONCLUSION AND FUTURE WORK
We examined a database of 26,361 IMRs handled by the California
DMHC through a private contractor. IMR processes are meant to
provide protection for patients whose doctors prescribe treatments
that are denied by their health insurance.</p>
          <p>We found that, in a majority of cases, IMRs uphold the health
insurance denial, despite DMHC’s claim to the contrary. In addition,
we analyzed the text of the reviews and compared them with a
sample of 50,000 Yelp reviews and the IMDB movie review corpus.</p>
          <p>Despite the fact that these corpora are basically twice as large, we
can construct a very good language model for the IMR corpus,
as measured by the quality of text generation, as well as its low
perplexity and high categorical accuracy on unseen test data. These
results indicate that movie and restaurant reviews exhibit a much
larger variety, more contentful discussion, and greater attention
to detail compared to IMR reviews, which seem highly templatic
and perfunctory in comparison. We see similar trends in topic
models and classification models predicting binary IMR outcomes
and binarized sentiment for Yelp and IMDB reviews.</p>
          <p>These results were further conrfimed by topic and language
models for four other specialized-register corpora (drug reviews,
data science job postings, legal-case reports and cooking recipes).</p>
          <p>We are in the process of extending our datasets with (i) workers’
comp cases from California and (ii) private insurance cases from
other states. This will enable us to investigate if the reviews for
workers’ comp cases are substantially diferent from the DMHC
IMR data (the percentage of upheld decisions is much higher for
workers’ comp: ≈ 90%), as well as if the reviews vary substantially
across states.</p>
          <p>Another direction for future work is to follow up on our
preliminary qualitative research with a survey of patients that have
experienced the IMR process to see if these patients agree with the
DMHC-promoted message that the IMR process provides strong
consumer protection against unjustified health-plan denials. This
could also enable us to verify if the medical documentation
collected during the IMR process is complete and actually taken into
account when the decision is made.</p>
          <p>The ultimate upshot of this project would be a list of
recommendations for the improvement of the IMR process, including but not
ACKNOWLEDGMENTS
We are grateful to four KDD-KiML anonymous reviewers for their
comments on an earlier version of this paper. We gratefully
acknowledge the support of the NVIDIA Corporation with the donation of
two Titan V GPUs used for this research, as well as the UCSC Ofice
of Research and The Humanities Institute for a matching grant to
purchase additional hardware. The usual disclaimers apply.
Knowledge Intensive Learning of</p>
          <p>Generative Adversarial Networks
Devendra Singh Dhami
devendra.dhami@utdallas.edu
The University of Texas at Dallas</p>
          <p>Mayukh Das</p>
          <p>Samsung Research India
mayukh.das@samsung.com</p>
          <p>Sriraam Natarajan
The University of Texas at Dallas
sriraam.natarajan@utdallas.edu
ABSTRACT
While Generative Adversarial Networks (GANs) have accelerated
the use of generative modelling within the machine learning
community, most of the applications of GANs are restricted to images.</p>
          <p>The use of GANs to generate clinical data has been rare due to the
inability of GANs to faithfully capture the intrinsic relationships
between features. We hypothesize and verify that this challenge can
be mitigated by incorporating domain knowledge in the generative
process. Specifically, we propose human-allied GANs that using
correlation advice from humans to create synthetic clinical data. Our
empirical evaluation demonstrates the superiority of our approach
over other GAN models.</p>
          <p>CCS CONCEPTS</p>
          <p>• Deep Learning → Generative Adversarial Networks; •
Application → Healthcare; • Learning → Knowledge Intensive
Learning.</p>
          <p>KEYWORDS</p>
          <p>generative adversarial networks, human in the loop, healthcare
ACM Reference Format:
Devendra Singh Dhami, Mayukh Das, and Sriraam Natarajan. 2020.
Knowledge Intensive Learning of Generative Adversarial Networks. In Proceedings
of KDD Workshop on Knowledge-infused Mining and Learning (KiML’20). ,
6 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
1</p>
          <p>
            INTRODUCTION
Deep learning models have reshaped the machine learning landscape
over the past decade [
            <xref ref-type="bibr" rid="ref22 ref9">16, 29</xref>
            ]. Specifically, Generative
Adversarial Networks (GANs) [
            <xref ref-type="bibr" rid="ref10">17</xref>
            ] have found tremendous success in
generating examples for images [34, 37, 45], photographs of human
faces [
            <xref ref-type="bibr" rid="ref18">1, 25, 52</xref>
            ], image to image translation [
            <xref ref-type="bibr" rid="ref23">30, 33, 55</xref>
            ] and 3D
object generation [44, 51, 53] to name a few. Despite such success,
there are several key factors that limit the widespread adoption of
GANs, for a broader range of tasks, including, widely acknowledged
data hungry nature of such methods, potential access issues of real
medical data and finally, their restricted usage, mainly in the
context of images. These factors have limited the use of these arguably
successful techniques in medical (or similar) domains. However,
recently, synthetic data generation has become a centerpiece of
research in medical AI due to the diverse difcfiulties in collection,
persistence, sharing and analysis of real clinical data.
          </p>
          <p>
            We aim to address the above limitations. Inspired by Mitchell’s
argument of “The Need for Biases in Learning Generalizations” [38],
we mitigate the challenges of existing data hungry methods via
inductive bias while learning GANs. We show that effective inductive
bias can be provided by humans in the form of domain
knowledge [
            <xref ref-type="bibr" rid="ref20 ref7">14, 27, 41, 50</xref>
            ]. Rich human advice can effectively balance
the impact of quality (sparsity) of training data. Data quality also
contributes to, the well studied, modal instability of GANs. This
problem is especially critical in domains such as medical/clinical
analytics that does not typically exhibit ‘spatial homophily’ [
            <xref ref-type="bibr" rid="ref14">21</xref>
            ],
unlike images, and are prone to distributional diversity among feature
clusters as well. Our human-guided framework proposes a robust
strategy to address this challenge. Note that in our setting the human
is an ally and not an adversary.
          </p>
          <p>
            The second limitation of access is crucial for medical data
generation. Access to existing medical databases [
            <xref ref-type="bibr" rid="ref11 ref3">10, 18</xref>
            ] is hard due to
cost and access concerns and thus synthetic data generation holds
tremendous promise [
            <xref ref-type="bibr" rid="ref12 ref6">6, 13, 19, 35, 48</xref>
            ]. While previous methods
generated synthetic images, we go beyond images and generate
clinical data. Building on this body of work, we present a synthetic data
generation framework that effectively exploits domain expertise to
handle data quality.
          </p>
          <p>We make a few key contributions:
(1) We demonstrate how effective human advice can be provided</p>
          <p>to a GAN as an inductive bias.
(2) We present a method for generating data given this advice.
(3) Finally, we demonstrate the effectiveness and efficacy of our
approach on 2 de-identified clinical data sets. Our method
is generalizable to multiple modalities of data and is not
necessarily restricted to images.
(4) Yet another feature of this approach is that training occurs
from very few data samples (&lt; 50 in one domain) thus
providing human guidance as a data generation alternative.
2</p>
          <p>
            RELATED WORK
The key principle behind GANs [
            <xref ref-type="bibr" rid="ref10">17</xref>
            ] is a zero-sum game [
            <xref ref-type="bibr" rid="ref19">26</xref>
            ] from
game theory, a mathematical representation where each participant’s
gain or loss is exactly balanced by the losses or gains of the other
participants and is generally solved by a minimax algorithm. The
generator distribution  () over the given data  is learned by
sampling  from a random distribution  () (initially uniform was
proposed but Gaussians have been proven superior [2]). While GANs
have proven to be a powerful framework for estimating generative
distributions, convergence dynamics of naive mini-max algorithm
has been shown to be unstable. Some recent approaches, among
many others, augment learning either via statistical relationships
between true and learned generative distributions such as Wasserstein-1
distance [3], MMD [
            <xref ref-type="bibr" rid="ref25">32</xref>
            ] or via spectral normalization of the
parameter space of the generator [39] which controls the generator
distribution from drifting too far. Although these approaches have improved
the GAN learning in some cases, there is room for improvement.
          </p>
          <p>
            Guidance via human knowledge is a provably effective way to
control learning in presence of systematic noise (which leads to
instability). One typical strategy to incorporate such guidance is
by providing rules over training examples and features. Some of
the earliest approaches are explanation-based learning (EBL-NN,
[49]) or ANNs augmented with symbolic rules (KBANN, [50]).
Various widely-studied techniques of leveraging domain knowledge
for optimal model generalization include polyhedral constraints in
case of knowledge-based SVMs, [
            <xref ref-type="bibr" rid="ref2 ref21 ref7">9, 14, 28, 47</xref>
            ]), preferences rules
[
            <xref ref-type="bibr" rid="ref20">5, 27, 41, 42</xref>
            ] or qualitative constraints (ex: monotonicities /
synergies [54] or quantitative relationships [
            <xref ref-type="bibr" rid="ref8">15</xref>
            ]). Notably, whereas these
models exhibit considerable improvement with the incorporation of
human knowledge, there is only limited use of such knowledge in
training GANs. Our approach resembles the qualitative constraints
framework in spirit.
          </p>
          <p>
            While widely successful in building optimally generalized models
in presence of systematic noise (or sample biases), knowledge-based
approaches have mostly been explored in the context of
discriminative modeling. In the generative setting, a recent work extends
the principle of posterior regularization from Bayesian modeling to
deep generative models in order to incorporate structured domain
knowledge [
            <xref ref-type="bibr" rid="ref15">22</xref>
            ]. Traditionally, knowledge based generative learning
has been studied as a part of learning probabilistic graphical models
with structure/parameter priors [36]. We aim to extend the use of
knowledge to the generative model setting.
3
          </p>
          <p>KNOWLEDGE INTENSIVE LEARNING OF</p>
          <p>GENERATIVE ADVERSARIAL NETWORKS
A notable disadvantage of adversarial training formulation is that
the training is slow and unstable, leading to mode collapse [2] where
the generator starts generating data of only a single modality. This
has resulted in GANs not being exploited to their full potential in
generating synthetic non-image clinical data. Human advice can
encourage exploration in diverse areas of the feature space and helps
learn more stable models [43]. Hence, we propose a human-allied
GAN architecture (HA-GAN) (figure 1). The architecture
incorporates human advice in form of feature correlations. Such intrinsic
relationships between the features are crucial in medical data sets
and thus become a natural candidate as additional knowledge/advice
in guided model learning for faithful data generation.</p>
          <p>
            Our approach builds upon a GAN architecture [
            <xref ref-type="bibr" rid="ref10">17</xref>
            ] where a
random noise vector is provided to the generator which tries to generate
examples as close to the real distribution as possible. The
discriminator tries to distinguish between real examples and ones generated
by the generator. The generator tries to maximize the probability
that the discriminator makes a mistake and the discriminator tries to
minimize its mistakes thereby resulting in a min-max optimization
problem which can be solved by a mini-max algorithm. We adopt
the Wasserstein GAN (WGAN) architecture1 [
            <xref ref-type="bibr" rid="ref13">3, 20</xref>
            ] that focuses
1We use ‘GAN’ to indicate ‘W-GAN’
          </p>
          <p>Devendra Singh Dhami, Mayukh Das, and Sriraam Natarajan
on defining a distance/divergence (Wasserstein or earth movers
distance) to measure the closeness between the real distribution and the
model distribution.</p>
          <p>Human input as inductive bias
Historically, two approaches have been studied for using guidance
as bias. The first is to provide advice on the labels as constraints
or preferences that controls the search space. Some example advice
rules on the labels include: (3 ≤ feature1 ≤ 5) ⇒ label = 1 and (0.6
≤ feature2 ≤ 0.8) ∧ (4 ≤ feature3 ≤ 5) ⇒ label = 0. Such advice
is more relevant in an discriminative setting but are not ideal for
GANs. Since GANs are shown to be sensitive to the training data
and here the labels are getting generated, they should not be altered
during training. The second is via correlations between features as
preferences (our approach) which allows for faithful representation
of diverse modality.</p>
          <p>Advice injection: After every fixed number of iterations, N, we
calculate the correlation matrix of the generated data G1 and provide
a set of advice  on the correlations between different features.
Consider the following motivating example for the use of correlations as
a form of advice.</p>
          <p>Example: Consider predicting heart attack with 3 features -
cholesterol, blood pressure (BP) and income. The values of the given
features can vary (sometimes widely) between different patients due
to several latent factors (ex, smoking habits). It is difficult to assume
any specific distribution. In other words, it is difcfiult to deduce
whether the values for the features come from the same distribution
(even though the feature values in the data set are similar).</p>
          <p>We modify the correlation coefcfiients (for both positive and
negative correlations) between the features by increasing them if the
human advice suggests that two features are highly correlated and
decrease the same if the advice suggests otherwise.</p>
          <p>Example: Continuing the above example, since rise in the
cholesterol level can lead to rise in BP and vice versa, expert advice here
can suggest that cholesterol and BP should be highly correlated.</p>
          <p>Also, as income may not contribute directly to BP and cholesterol
levels, another advice here can be to de-correlate cholesterol/BP
and income level.</p>
          <p>The example advice rules ∈  are: 1. Correlation(“cholesterol
level",“BP")↑, 2. Correlation(“cholesterol level",“income level")↓
and 3. Correlation(“BP",“income level")↓, where ↑ and ↓ indicate
increase and decrease respectively. Based on the 1st advice we need
to increase the correlation coefficient between cholesterol level and
BP. Then</p>
          <p>(1)
Here C is the correlation matrix, A is the advice matrix and  is the
factor by which the correlation value is to be augmented. In case
where we need to increase the value of the correlation coefficient, 
should be &gt; 1. We keep  = 1( | C |) . Since -1.0 ≤ ∀ ∈ C ≤ 1.0,
in this case, the value of  ≥ 1.0, leading to enhanced correlation via
1
 ( , E ,  ) =</p>
          <p>|
Õ</p>
          <p>E ∈
Õ</p>
          <p>4
−</p>
          <p>(E )
E ∈</p>
          <p>{z
cost penalty
}
Õ</p>
          <p>(E ;  )
E ∈</p>
          <p>{z
max. relevance
}
3
 (E  ; E ) −</p>
          <p>{z
min. redundancy
Õ
E ∈
 (E ; E  | )ª®
¬
}</p>
          <p>(1)
where  (E ;  ) is the mutual information between the random
variable E (feature) and  (target). In the above equation, the feature
subset  is grown greedily using a greedy optimization strategy
maximizing the above objective function. In equation 1, E denotes
a single feature from the elicitable set E that is considered for
evaluation based on the subset  grown so far. The first term is the
mutual information between each feature and the class variable  .</p>
          <p>In a discriminative task, this value should be maximized. The
second term is the pairwise mutual information between each feature
to be evaluated and the features already added to the feature subset
 . This value needs to be minimized for selecting informative
features. The third term is called the conditional redundancy [1] and
this term needs to be maximized. The last term adds the penalty
for cost of every feature and ensures the right trade-of between
cost, relevance and redundancy. For this work, we do not learn the
parameters  for each cluster, instead fix these parameters to 1.</p>
          <p>We leave the learning of these parameters to future work.</p>
          <p>In the problem setup, since the 0 cost feature subset is always
present, we always consider the observed feature subset O in
addition to the most important feature subset as returned by the
Feature selector objective. We also account for the knowledge of
the observed features while growing the informative feature subset
through greedy optimization. Specifically, while calculating the
pairwise mutual information between the features and the
conditional redundancy term (second and third term of equation 1), we
also evaluate the mutual information of the features with these
observed features. It is to be noted that in cases where the observed
features are not discriminative enough of the target, the feature
selector module ensures that the elicitable features with maximum
relevance to the target variable are picked.</p>
          <p>Optimization Problem: The cost aware feature selector
 ( , E ,  ) for a given set of instance E belonging to a specific
cluster  solves the following optimization problem:</p>
          <p>= argmax ⊆ E  ( , E ,  )</p>
          <p>For a given instance (, ), we denote  (, ,  ,  ) as the loss
function using a subset  of the features as obtained from the
Feature selector optimization problem. The optimization problem
for learning the parameters of a classifier can be posed as:</p>
          <p>min Õ  (, ,  ,  ) + 1 ( ) + 2 || ||2
 =1
the cluster to which a set of instances belong and is defined as:
where 1 and 2 are hyper-parameters. In the above equation, 
is the parameter of the model and can be updated by standard
gradient based techniques. This loss function takes into account the
important feature subset for each cluster and updates the parameter
accordingly. The classifier objective also consists of a cost term
denoted by  ( ) to account for the cost of the selected feature
subset. For hard budget on the elicited features, the cost component
in the model objective can be considered. In case of a cost budget,
this component can be ignored because the elicited feature subset
adheres to a fixed cost and hence, this term is constant.
3.3</p>
          <p>Algorithm
We present the algorithm for Cost Aware Feature Elicitation
(CAFE) in Algorithm 1. CAFE takes as input set of training examples
E, the zero cost feature set O, the elicitable feature subset E, a cost
vector  ∈ R and a budget . Each element in the training set E
consists of a tuple (, ) where  ∈ R is the feature vector and y
is the label.</p>
          <p>The training instances E are clustered based on just the observed
feature set O using K-means clustering (Cluster). For every cluster
 , the training instances belonging to the cluster is assigned to
the set E and is passed to the Feature Selector module (lines 6-8).</p>
          <p>The FeatureSelector function takes E , parameter  , the feature
subsets O and E, cost vector  and a predefined budget  as input
and returns the most important feature subset X corresponding
to a cluster  . A greedy optimization technique is used to grow
the feature subset  of every cluster based on the feature selector
objective function defined in Equation 1. The FeatureSelector
terminates once the budget  is exhausted or the mutual
information score becomes negative. Once all the important feature subsets
are obtained for all the | | clusters, the model objective function is
optimized as mentioned in Equation 3 for all the training instances
using the important feature subsets for the clusters to which the
training instances belong (lines 12-18). All the remaining features
are imputed by using either 0 or any other imputation model
before training the model. The final training model G(EO∪ , ,  )
is an unified model used to make predictions for a test-instance
consisting of just the observed feature subset O.
4</p>
          <p>EMPIRICAL EVALUATION
We did experiments with 3 real world medical data sets. The
intuition of CAFE makes more sense in medical domains, hence our
choice of data sets. However, the idea can be applied to other
domains ranging from logistics to resource allocation task. Table 2
jots down the various features of the data sets used in our
experiments. Below are the details of the 3 real data sets, we use for our
experiments.
(2)
(3)
1. Parkinson’s disease prediction: The Parkinson’s Progression</p>
          <p>
            Marker Initiative (PPMI) [
            <xref ref-type="bibr" rid="ref5">12</xref>
            ] is an observational study where the
aim is to identify Parkinson’s disease progression from various
types of features. The PPMI data set consists of various features
related to various motor functions and non-motor behavioral and
psychological tests. We consider certain motor assessment features
like rising from chair, gait, freezing of gait, posture and postural
stability as observed features and rest all features as elicitable features
which must be acquired at a cost.
Algorithm 1 Cost Aware Feature Elicitation
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
1: function CAFE(E, O, E, , )
2: E = EO∪E ⊲ E consists of 0 cost features O and costly
          </p>
          <p>features E
3:  = Cluster(EO ) ⊲ Clustering based on the observed
features O</p>
          <p>X = {∅} ⊲ Stores best feature subsets of each cluster
for  = 1 to | | do ⊲ Repeat for every cluster</p>
          <p>E = GetClusterMember(E, , )</p>
          <p>⊲ get the data points belonging to each cluster 
X = FeatureSelector(E , , O, E, , )
⊲ Parameterized feature selector for each cluster</p>
          <p />
          <p>X = X ∪ {X ∪ O}
end for
for  = 1 to | | do ⊲ Repeat for every cluster</p>
          <p>X = GetFeatureSubset(X, )</p>
          <p>⊲ Get the feature subset for each cluster 
for  = 1 to |E | do ⊲ Repeat for every data point in
cluster 
16: Optimize  (  ,   , X , , )
17: ⊲ Optimize the objective function in Equation 3
18: Update  ⊲ Update the model parameter 
19: end for
20: end for</p>
          <p>
            return G(EO∪ , ,  ) ⊲ G is the training model built on E
21: end function
2. Alzheimer’s disease prediction: The Alzheimer’s Disease
NeuroIntiative (ADNI1) is a study that aims to test whether various
clinical, FMRI and biomarkers can be used to predict the early onset
of Alzheimer’s disease. In this data set, we consider the
demographics of the patients as observed and zero cost features and the FMRI
image data and cognitive score data as unobserved and elicitable
features.
3. Rare disease prediction This data set is created from survey
questionnaires [
            <xref ref-type="bibr" rid="ref4">11</xref>
            ] and the task here is to predict whether a person
has rare disease or not. The demographic features are observed
while other sensitive questions in the survey regarding technology
use, health and disease related meta information is considered to
be elicitable.
          </p>
          <p>Evaluation Methodology: All the data sets were partitioned
into a 80:20 train-test split. Hyper parameters like the number of
clusters on the observed features were picked by doing 5 fold cross
validation on all the data sets. The optimal number of clusters
picked were 6 for ADNI, 9 for Rare disease data set and 7 for the
PPMI data set. For the results reported in Table 1, we considered a
hard budget on the number of elicitable features and set it to half
of the total number of features in the respective data set. We use
Kmeans clustering as the underlying clustering algorithm. For all the
reported results, we use an underlying Support Vector Machine [3]
classifier with Radial basis kernel function. Since, all the data sets
are highly imbalanced, hence we consider metrics like recall, F1,
AUC-ROC and precision for our reported results. For the Feature
selector module, we used the existing implementation of Li et al. [7]
1www.loni.ucla.edu/ADNI
and built upon it. We consider two variants of CAFE:(1) CAFE in
which we replace the missing and unimportant features of every
cluster with 0 and then update the classifier parameters (2) CAFE-I
where we replace the missing and unimportant features by using an
imputation model learnt from the already acquired feature values
of other data points. A simple imputation model is used where we
replace the missing features with mode for categorical features and
mean for numeric features.</p>
          <p>Baselines: We consider 3 baselines for evaluating CAFE and
CAFE-I: (1) using the observed and zero cost features to update
the training model denoted as OBS (2) using a random subset of
ifxed number of elicitable features and all the observed features
to update the training model denoted as RANDOM. For this baseline,
the results are averaged over 10 runs. (3) using the information
theoretic feature selector score as defined in Equation 1 to select
the ’k’ best elicitable features on the entire data without any cluster
consideration along with the observed features denoted as KBEST.</p>
          <p>We keep the value of ’k’ to be the same as that used by CAFE.</p>
          <p>Although some of the existing methods could be potential baselines,
none of these methods match the exact setting of our problem, hence
we do not compare our method against them.</p>
          <p>Results: We aim to answer the following questions:
Q1: How does CAFE and CAFE-I with hard budget on features</p>
          <p>compare against the standard baselines?
Q2: How does the cost-sensitive version of CAFE and CAFE-I</p>
          <p>fare against the cost-sensitive baseline KBEST?</p>
          <p>The results reported in Table 1 suggests both CAFE and
CAFEI significantly outperform the other baselines in almost all the
metrics for Rare disease and PPMI data set. For ADNI, CAFE and
CAFE-I outperform the other baselines in clinically relevant recall
metric while KBEST performs the best for the other metrics. The
reason for this is that in ADNI, since, the elicitable features are
image features and we discretize the image features to calculate
the information gain for the feature selector module, the granular
level feature information is lost because of this discretization and
hence the drop in performance. For the experiments in Table 1,
we keep the budget to be approximately half of the total number
Rare disease
of features for all the methods. On an average, CAFE-I performs
better than CAFE across all the data sets because of the underlying
imputation model which helps in better treatment of the missing
values as against replacing all the features by 0. This answers Q1
afirmatively.</p>
          <p>In Figure 3, we compare the cost version of CAFE and CAFE-I
against KBEST. Cost version takes into account the cost of
individual features and accounts for them as penalty in the feature selector
module. Hence, in this version of CAFE, a cost budget is used as
opposed to hard budget on the number of elicitable features. We
generate the cost vector by sampling each cost component uniformly
from (0,1). For PPMI and Rare disease, we can see that cost sensitive
CAFE performs consistently better than KBEST with increasing
cost budget. In the PPMI data set, the greedy optimization of the
feature selector objective on the entire data set lead to elicitation of
just 1 feature, beyond that the information gain was negative, hence
the performance of PPMI across various cross budget remains the
same. CAFE on the other hand was able to select important feature
subsets for various clusters based on the observed features related
to gait and postures. For ADNI data set, CAFE performs better than
KBEST only in recall. The reason for this is the same as mentioned
above. This helps in answering Q2 afirmatively.</p>
          <p>Lastly, Figure 2 shows the efect of increasing cluster on the
validation recall for the Rare disease data set. As can be seen, for
smaller number of clusters, the recall is very low and increases to
an optimum for 9 clusters. This helps us in understanding the fact
that forming clusters based on observed important features helps
CAFE in selecting diferent feature subsets for diferent clusters,
thus helping the learning procedure.
5 CONCLUSION
In this paper, we pose the prediction time feature elicitation problem
as an optimization problem by employing a cluster specific feature
selector to choose the best feature subset and then optimizing the
training loss. We show the efectiveness of our approach in real data
sets where the problem set up is intuitive. Future work includes
learning the parameters of the feature selector module and jointly
optimizing the feature selector and model parameters for a more
robust framework and adding more constraints to optimization.</p>
          <p>ACKNOWLEDGEMENTS
SN &amp; SD gratefully acknowledge the support of NSF grant
IIS1836565. Any opinions, findings and conclusion or
recommendations are those of the authors and do not necessarily reflect the
view of the US government.</p>
          <p>A New Delay Differential Equation Model for COVID-19</p>
          <p>B Shayak†
Mechanical and Aerospace Engg</p>
          <p>Cornell University
Ithaca, New York State, USA
sb2344@cornell.edu</p>
          <p>Retarded logistic equation</p>
          <p>Mohit M Sharma
Population and Health Sciences</p>
          <p>Weill Cornell Medicine</p>
          <p>New York City, USA
mos4004@med.cornell.edu</p>
          <p>Manas Gaur</p>
          <p>AI Institute
University of South Carolina</p>
          <p>USA
mgaur@email.sc.edu
ABSTRACT
In this work we give a delay diferential equation , the retarded
logistic equation, as a mathematical model for the global
transmission of COVID-19. Th is model accounts for asymptomatic
carriers, pre-symptomatic or latent transmission as well as contact
tracing and quarantine of suspected cases. We find that the
equation admits varied classes of solutions including self-burnout,
progression to herd immunity and multiple states in between. We
use the term “partial herd immunity” to refer to these states,
where the disease ends at an infection fraction which is not
negligible but is significantly lower than the conventional herd
immunity threshold. We believe that the spread of COVID-19 in
every localized area can be explained by one of our solution
classes.</p>
          <p>CCS CONCEPTS
• Applied computing – mathematics and statistics
KEYWORDS
Retarded logistic equation, Asymptomatic carriers, Latent
transmission, Contact tracing, Reproduction number calculation,
Partial herd immunity
1 Introduction</p>
          <p>Three kinds of models to study COVID-19 are currently in
vogue – lumped parameter or compartmental models (ordinary
differential equation), agent-based models and stochastic
differential equation models. The first option affords maximum
conceptual clarity at the expense of some simplifying assumptions
†Presenting author, Corresponding author. ORCID : 0000-0003-2502-2268
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San
Diego, California, USA, August 24, 2020. Use permited under Creative Commons
License Atribution 4.0 International (CC BY 4.0).</p>
          <p>KiML'20, San Diego, California, USA
© 2020, Copyright held by the author(s).
(homogeneous mixing etc). The second option affords maximum
potential versatility at the cost of huge computational complexity
and variability in the network structure. The third option
combines features of the previous two – whether the features
being synergized are the positive or the negative ones depends to
a large extent on the modeler.</p>
          <p>In this work we use delay differential equations (DDE) to
propose a simple, single-variable, lumped parameter model for the
spread of Coronavirus. Jahedi and Yorke [1] make a strong case
for simpler models relative to complex and elaborate ones. In the
Literature, DDE has been used for modeling COVID-19, for
example in Refs. [2]–[4]. These authors however ignore features
such as contact tracing, asymptomatic carriers and latent
transmission; our results too have a richer structure.
2 Derivation of the model</p>
          <p>We measure time t in days and use as our basic variable y(t)
which is the cumulative number of corona cases, including active
cases, recovered cases and deaths, in the region of interest. The
following “word-equation” summarizes the approach :
 Rate of emergence   Interaction rate of 
  =   
 of new cases   each existing case 
 Probability of   Number of 
    
 transmission   existing cases </p>
          <p>(0)</p>
          <p>The left hand side (LHS) here is just dy/dt whereas the right
hand side (RHS) needs a detailed derivation.</p>
          <p>Equation (0) assumes that the disease is transmitted from
infected to susceptible people via interaction, and not via airborne
transmission. Due to asymptomatic and pre-symptomatic carriers,
there are always cases moving about in society who are oblivious
to their infectivity. Each such case interacts with other people at
a different rate. For example, a working-from-home professor
might venture outside once every three days and interact with one
person on each trip while a grocer might go to work and interact
with 10 customers every day. The professor has an interaction rate
of 1/3 persons/day while the grocer has interaction rate of 10
persons/day. For a compartmental model, one must average over
the professor, the grocer and all the other un-quarantined cases to
generate an effective per-case interaction rate q0.</p>
          <p>
            Every interaction of course does not result in a transmission –
there is a probability strictly less than unity that the virus jumps
from the infected person to the person whom s/he is interacting
with. This probability has two components. The first component
is that the healthy person must be susceptible to begin with. While
we ignore intrinsic insusceptibles, there will be people who have
recovered from the disease and are therefore not susceptible
again. In this Article, we assume that one bout of infection brings
permanent immunity. The assumption is valid so long as the
immunity period exceeds the total epidemic duration. Till date,
there is little credible evidence for re-infection [5]–[7]; contrarily,
a very recent and thorough study [
            <xref ref-type="bibr" rid="ref1">8</xref>
            ] based on monitoring of huge
patient cohort has found significant evidence of long-lasting and
effective antibodies. If N be the initial number of susceptible
people (recall that y is the case count), then the probability that a
random person is a recovered case is approximately y/N and the
probability that s/he is susceptible is (approximately) 1−y/N. This
expression is approximate because the true number of recovered
cases at any time is less than y; the error however is small since
the recovery period is much shorter than the overall course of the
epidemic. Note that 1−y/N is a logistic term, and a herd immunity
effect.
          </p>
          <p>Given susceptibility, the next probability is that the virus
actually does jump from the un-quarantined case to the
susceptible person. This probability depends on the level of
precaution such as face covering or mask, handwashing and
disinfection being adopted by the case as well as the susceptible
person. For a compartmental model, the probability must be
averaged over all the un-quarantined cases. If this average
probability is P0, then q0(1−y/N)P0 gives the per-case spreading
rate. Since q0 and P0 are both dependent on public health
measures, and are both difficult to measure independently, we can
club those two together into a single parameter which we call m0.</p>
          <p>So far we have accounted for the rate at which each cases
spreads the disease; now we have to count the number of cases
out of quarantine. Let us start with an asymptomatic carrier, who
remains in open society throughout. S/he typically transmits the
disease for 7 days, which is called the infection period. Then, new
healthy people can be only be infected by those asymptomatic
cases who have fallen sick within the last 7 days, and not those
who have fallen sick earlier. The number of such people is the
number of asymptomatic sick people today minus the number of
those 7 days earlier. Mathematically, let μ1 (between 0 and 1)
denote the fraction of asymptomatic carriers and τ1 the
asymptomatic infection period. Then, the number of
asymptomatic transmitters today is μ1(y(t)−y(t−τ1)). Here we can
see the emergence of the delay term.</p>
          <p>The remaining fraction 1−μ1 of cases are symptomatic. Let τ2
be the latency period during which these cases remain
transmissible prior to displaying symptoms. It is assumed that
they isolate themselves thereafter. Assumption is also made that
the incubation period is equal to the latency period. Finally, the
3 Solutions of the model</p>
          <p>Due to complexity of the equation (2), analytical solution using
perturbation theory etc has not been attempted in this case.</p>
          <p>
            Instead we have used numerical integration to obtain the
solutions of (2). Before giving the solutions however, we present
the calculation of the reproduction number R. To find R at any
state of evolution of the disease, we first treat y in the logistic term
to be constant, and then carry out the steps described in Ref. [
            <xref ref-type="bibr" rid="ref2">9</xref>
            ].
          </p>
          <p>This yields the expression
contact tracing drive conducted by public health department is
taken into account. Assumption is made that this drive is
instantaneous and proceeds in forward direction starting from
freshly arriving symptomatic cases. The contact trace captures
patients who were exposed to the new case τ2 days ago, as well as
patients who were exposed immediately before the new case
manifested symptoms. The average duration for which these
secondary patients have remained at large is τ2/2, be they
symptomatic or asymptomatic. The assumption of instantaneous
contact tracing, which decreases the average time that
contacttraced cases spend out of quarantine, opposes the error arising
from the assumption of zero non-transmissible incubation period,
which increases the average time for which the contact-traced
cases transmit before quarantine. These two effects are assumed
here to cancel. Let μ3 (between 0 and 1) denote the fraction of all
cases who escape from contact tracing drives – the
complementary fraction 1−μ3 get caught. Thus, we have three
classes of un-quarantined cases : (a) 1−μ3 are contact-traced cases
who remain in society for a time τ2/2, (b) μ3 (1−μ1) are untraced
symptomatic cases who go into isolation only after time τ2, and (c)
μ3μ1 are undetected asymptomatic cases who transmit for the
entire infection period τ1. Arguments similar to those of the
previous paragraph yield the total number of un-quarantined
cases as
n = (1 − μ3 ) ( y − y (t − τ 2 / 2) ) +
( (1 − μ1 ) μ3 ) ( y − y (t − τ2 ) ) + μ1 μ3 ( y − y (t − τ1 ) )</p>
          <p>The preceding arguments now yield the mathematical form of
(0) as
dy
dt
= m</p>
          <p>
0 
1 −
y   y (t ) − (1 − μ3 ) y (t − τ2 / 2) −</p>
          <p>
N  (1 − μ1 ) μ3 y (t − τ2 ) − μ1 μ3 y (t − τ1 ) </p>
          <p>
(2)
which is the retarded logistic equation.</p>
          <p>
            . (1)
value as per our knowledge [
            <xref ref-type="bibr" rid="ref4">11</xref>
            ]–[
            <xref ref-type="bibr" rid="ref6">13</xref>
            ]), τ1=7 days and τ2=3 days
[
            <xref ref-type="bibr" rid="ref7">14</xref>
            ]. The initial condition needs to be a function having the length
of the maximum delay involved in the problem, which is seven
days; we take this function to be zero cases to start with and
constant increase of 100 cases/ day for a week.
          </p>
          <p>
            Notional City A has m0=0.23 and μ3=1/2, which describes a
hard lockdown [
            <xref ref-type="bibr" rid="ref8">15</xref>
            ] accompanied by good contact tracing. R0 (i.e.
(3) evaluated at y=0) is 0.886. The epidemic ends with a negligible
fraction of infected people, as shown below. This and the next five
plots are three-way – each plot shows y as blue line, its derivative
y as green line and the weekly increments in cases, or
epidemiological curve, as a grey bar chart. These last have been
reduced by a factor of 7 to ensure clarity of presentation. We
report the rates on the left hand side y-axis and the cumulative
cases on the right hand side y-axis.
          </p>
          <p>Figure 1 : City A extinguishes the epidemic in time.</p>
          <p>This is exactly what has happened in New Zealand – that il
fortunatissimo per verita has indeed quashed the epidemic
completely with the final case count being a negligible fraction of
its total (tiny and sparsely distributed) population.</p>
          <p>The parameter values for Notional City B are the same as those
for A except that μ3=0.75; a greater fraction of cases escape the
contact tracing drive. R0 is 1.16, and R becomes 1 at y=40500 cases.</p>
          <p>Figure 2 : City B grows at first before reaching burnout.</p>
          <p>The symbol ‘k’ denotes thousand.</p>
          <p>The outbreak enters exponential regime right after being
released. As y increases, R gradually reduces so the growth slows
down until it peaks when the case count is about 39,000 [compare
with the value of 40,500 when R=1 as per (3)]. Thereafter, the
disease progresses to extinction in time. The overall progression
is very long but one hopes that the relatively small size of the peak
can prevent overstressing of medical care facilities and thus avoid
unnecessary deaths. Delhi and Mumbai in India and Los Angeles
in USA are in all probability cities of this type since the disease
there spiraled out of control despite hard lockdowns being
imposed at an early stage.</p>
          <p>City B also enables us to explain partial herd immunity. Even
though the initial conditions were unfavourable for containment
of the epidemic, herd immunity started activating as the disease
proliferated. A stable zone (R&lt;1) was entered when only 13.5
percent of the total susceptible population was infected, and a
similar percentage again got infected before the epidemic ended.</p>
          <p>
            Thus, herd immunity worked in synergy with
nonpharmaceutical interventions to stop the epidemic at only 26
percent infection level, which is significantly less than the
conventional 70-90 percent threshold [
            <xref ref-type="bibr" rid="ref9">16</xref>
            ]. This is what we call
partial herd immunity. Our findings are in agreement with and act
as an explanation for what has been obtained by Britton et. al. [
            <xref ref-type="bibr" rid="ref10">17</xref>
            ]
and Peterson et. al. [
            <xref ref-type="bibr" rid="ref11">18</xref>
            ].
          </p>
          <p>We now consider Notional City C which differs from City B in
that m0=0.5; lockdown is replaced by a much more permissive
state. R0 is above 2.5; 1,80,000 infections are required to bring it
below unity.</p>
          <p>Figure 3 : City C goes to herd immunity – total not
partial. The symbol ‘k’ denotes thousand and ‘L’ hundred
thousand.</p>
          <p>Need one mention that this is a public health disaster. Notional
City D combines features of B and C. This city begins with m0=0.5
like City C but reduces to m0=0.23 like City B when the case count
reaches 40,000 (the R=1 threshold for B’s parameters).</p>
          <p>Figure 4 : As the input, so the output – D’s response
combines features of B and C. The symbol ‘k’ denotes
thousand and ‘L’ hundred thousand.</p>
          <p>We can see a case count as well as a total duration intermediate
to B and C; the epidemic is over in 70 days but the peak rate of
12,920 cases/day is still very high and likely to load hospital
facilities beyond their carrying capacity.</p>
          <p>The Cities E and F demonstrate the issues faced in reopening.</p>
          <p>In both these cities, the parameters and case trajectory are
identical to those of City A for the first 80 days. Then, E and F
reopen on the 80th day by increasing m0 from 0.23 to 0.5, and
simultaneously decreasing μ3 i.e. deploying a more effective
contact tracing program which had been built up during the
lockdown. The post-reopening μ3’s for E and F are 0.1 and 0.2
respectively.</p>
          <p>Figure 5 : City E, like City A, is a success story.</p>
          <p>Figure 6 : Unlike City E, F is a failure story. The symbol
‘k’ denotes thousand and ‘L’ hundred thousand.</p>
          <p>The difference between Cities E and F is dramatic.</p>
          <p>Mathematically, R remained less than unity throughout in E; its
value after reopening was 0.985. We can see that the case rate
decreases monotonically all the time. In F, the post-reopening R
became 1.22 and sent the trajectory haywire. In practice however,
the incipient increase in case rate after the 80th day acts as an
advance warning of what has happened – the reopening steps
should be reversed if it is at all possible to do so while satisfying
economic and other external constraints.</p>
          <p>Conclusion</p>
          <p>
            In this Article we have presented a new mathematical model
for COVID-19 which is simple and elegant in structure but can
generate a variety of realistic solution classes. We hope that our
work may be of use to mathematicians and data scientists who are
trying to understand the spread of the disease in a quantitative
manner. The public health implications of these results are being
reserved for another study.
[3]
[4]
[5]
[6]
[7]
[
            <xref ref-type="bibr" rid="ref1">8</xref>
            ]
[
            <xref ref-type="bibr" rid="ref2">9</xref>
            ]
[
            <xref ref-type="bibr" rid="ref3">10</xref>
            ]
[
            <xref ref-type="bibr" rid="ref4">11</xref>
            ]
[
            <xref ref-type="bibr" rid="ref5">12</xref>
            ]
[
            <xref ref-type="bibr" rid="ref6">13</xref>
            ]
[
            <xref ref-type="bibr" rid="ref7">14</xref>
            ]
[
            <xref ref-type="bibr" rid="ref8">15</xref>
            ]
[
            <xref ref-type="bibr" rid="ref9">16</xref>
            ]
[
            <xref ref-type="bibr" rid="ref10">17</xref>
            ]
[
            <xref ref-type="bibr" rid="ref11">18</xref>
            ]
          </p>
          <p>Public Health Implications of a delay differential equation model for COVID 19</p>
          <p>Mohit M Sharma
Population and Health Sciences</p>
          <p>Weill Cornell Medicine</p>
          <p>New York City, USA</p>
          <p>B Shayak
Sibley School of Mechanical and</p>
          <p>Aerospace Engineering</p>
          <p>Cornell University</p>
          <p>Ithaca, New York State, USA
ABSTRACT</p>
          <p>This paper describes the strategies derived from a novel delay
differential equation model[1], signifying a practical extension
of our recent work. COVID -19 is an extremely ferocious and an
unpredictable pandemic which poses unique challenges for
public health authorities, on account of which “case races”
among various countries and states do not serve any purpose and
present delusive appearances while ignoring significant
determinants. We aim to propose comprehensive planning
guidelines as a direct implication of our model. Our first
consideration is reopening, followed by effective contact tracing
and ensuring public compliance. We then discuss the
implications of the mathematical results on people’s behavior
and eventually provide conclusive points aimed at strengthening
the arsenal of resources that are helpful in framing public health
policies. The knowledge about pandemic and its association with
public health interventions is documented in the various
literature-based sources. In this study, we explore those resources
to explain the findings inferred from delay differential equation
model of covid-19.</p>
          <p>KEYWORDS
1 INTRODUCTION</p>
          <p>The national (USA) and global spread of Coronavirus Disease
2019 (COVID-19), following its origins in Wuhan, China in at
least December 2019 and possibly earlier still [2] has been
alarmingly rapid and deadly. From the 25 individual national
forecasts received by CDC, predicts that there is possibility of
the total reported COVID -19 deaths is between 160,000 and
175,000 by August 15th, 2020 [3]. Some features however, both
nationally and globally, have proved counterintuitive. For
example, a 76-day lockdown resulted in the outbreak’s
containment in Wuhan. A similar measure has produced similar
results in New Zealand. However, lockdown appeared only
marginally effective in New York State, USA where the case and
death counts decreased only after reaching horrifying peak levels
[4]. It was contended that the stay at home order in New York
came too late. This apparent delay was not present in California,
USA. The case counts there went up all the same, and the rate is
high even today. We would like to mention that such
spatiotemporal anomalies are present not just in the US but also
in other countries such as Canada, Russia and India [5] which
witnessed high case growth despite being in lockdown. In order
to better understand the epidemiology of the transmission of
COVID-19, we have constructed a delay differential equation
model. Here we present its practical implications which tries to
encapsulate a myriad of factors associated with the current
scenario.
2 MATHEMATICAL MODELING TO
UNDERSTAND THE EPIDEMIOLOGY</p>
          <p>Since many decades, mathematical modelling has been used
as an integral tool in recognizing the trend of disease progression
during pandemics. For example, using a simple model explaining
the transmission dynamics of the infectious disease between the
susceptible, infected and recovered population ( SIR Epidemic
Models) Kermack and McKendrick proposed and later
established a principle – the level of susceptibility in the
population should be adequately high in order for that epidemic
to unfold in that population. Such mathematical models can give
impressionable insights in explaining the epidemiological status
of the population, predict or calculate the transmissibility of the
pathogen and the potential impact of public health preventive</p>
          <p>practices [6]. However, a significant body of evidence
suggests that decisions should be made regarding the parameters
to be included, being contingent on the impact of the precision of
predictions. Several policy questions about the containment of
this outbreak have been considered in our recently proposed
simple non-linear model [1]This paper delves into the practical
solutions that can be devised utilizing the directions of our
models’ outcome.</p>
          <p>In generating interpretable results gathered from
epidemiological models, we have used the examples of six types
of cities [1]:</p>
          <p>1) City A – Moderately effective contact
tracing in a hard lockdown. This city has R
(reproductive number) &lt;1 and drives epidemic to
extinction in time.</p>
          <p>2) City B – Less effective contact tracing in a
hard lockdown. It starts off R &gt;1, but reached R =1 at
15% infection level. The epidemic ends at 30%
infection rate and takes a very long time to get there.</p>
          <p>3) City C – Less effective contact tracing (Like
City B) with milder restriction on mobility. It proceeds
rapidly to herd immunity.</p>
          <p>4) City D – Combination of City B and City C.</p>
          <p>Starts with mild restriction on mobility and progresses
towards restriction. The duration of the epidemic as
well as of the final case count is between CITIES B
and C.</p>
          <p>5) City E - Starts off like City A, it reopens with
very effective contact tracing and drive the epidemic to
extinction in time.</p>
          <p>6) City F – Starts off like CITY A, it reopens
with less effective contact tracing and suffers a second
wave.</p>
          <p>Pragmatic implications of our work are as follows:
3 REOPENING CONSIDERATIONS, ROLE
OF TESTING</p>
          <p>The unemployment situation generated as a result of
lockdowns is currently forcing countries and states to partially
reopen their economies even though many of them have not yet
got the virus under control. The reopening is easiest in City A
regions where cases have slowed down to a trickle. With every
new case being detected, swift isolation of all potential
secondary, tertiary and maybe even quaternary cases, both
forward and backward, should prove possible while the rest of
the economy functions in a relatively uninhibited way. Even one
mass transmission event can restart an exponential growth
regime and force a rollback to a fully locked down state.</p>
          <p>Reopening beyond a skeletal level is impossible in City B regions
which are still in the ascending phase. The ascent implies that
contact tracing is already inadequate, and on top of that if
mobility increases then the region might turn into City C,
overstress healthcare systems, and become a massacre. An
ascending B-City has little option other than to contact trace as
hard as possible and wait for partial herd immunity to kick in.</p>
          <p>Only when that happens and the cases slow down on their own
can it consider a more extensive reopening like a City A region.</p>
          <p>
            Testing is an important part of the epidemic management
process no doubt since it enables the authorities to get an accurate
description of the spread of the disease. As we have already
discussed, limited testing capacity is giving us a partial or
distorted picture in many regions. There is a widespread media
perception that extensive testing is one of the prerequisites for
any kind of reopening process [7], [
            <xref ref-type="bibr" rid="ref1">8</xref>
            ]. Much criticism has also
been levelled at certain countries for having inadequate testing
programs (we shall further elaborate the blame aspects later).
          </p>
          <p>However, we would like to emphasize that testing is as of yet a
diagnostic tool and not a preventive one. Currently, it can show
us how the disease is behaving but cannot slow its spread in any
way. Test-induced slowing can come only when the capacity
expands to such a level as to be able to preventively test potential
super-spreaders such as grocers and food workers every single
day. We hope that such a development may prove possible in the
near future – many Universities for example are making
reopening arrangements with provision for very frequent testing
of the entire community.</p>
          <p>
            During reopening it is vital to get a true picture of the disease
evolution so that we can gauge the effect of any relaxation of
restrictions – whether it keeps the outbreak under control as in
City E or brings about the beginnings of a second wave as in City
F. Such beginnings are heralded by a rise in the case rate. As we
saw, there was no such rise in City E even though R increased
after the reopening. If the rise takes place, the relaxation must
immediately be rolled back to avert the disaster. Hence, during
reopening, the testing capacity must be high enough to detect
such incipient rises. As per China’s state media reports, with an
aim to reopen the economy, the city of Wuhan conducted 6
million tests in one week; we present this fact without discussion
or comment. A second reason why testing is still not all that it
could have been is the high false-negative rate during the initial
stages of infection [
            <xref ref-type="bibr" rid="ref2">9</xref>
            ]. Suppose a contact tracing drive identifies
Mr X as a potential case, having been exposed to a known case
yesterday. Then, it can be that Mr X contracts the virus ten days
from now, in which situation he will report negative if tested
today or tomorrow, but will still amount to a spreading risk ten
days later if he is at large then. This also means that secondary
contact tracing, i.e. finding Mr. X’s contacts, must go ahead
irrespective of his test results. Indeed, the medical authorities are
well aware of this loophole.
          </p>
          <p>The US Chamber of Commerce has given out state by state
reopening guides for small businesses which are mandated to be
followed across the US. Continued following of federal, state,
tribal, territorial and local recommendations is of paramount
importance.</p>
          <p>Prior to resuming work, all workplaces should have a
carefully chartered exposure control, mitigation and recovery
plan. Although essential guidance is specific for each business,
there are certain measures that can be generally adopted across
all workplaces.</p>
          <p>1) Reopening in phases – The US government has laid down
guidelines to open the country in 3 phases. First phase involves
continuation of vulnerable individuals to remain at home. When
in public, people are expected to wear masks, have maximum
physical separation, avoid places with more than 10 people and
limit non-essential travel. Second phase allows gatherings of 50
people, some nonessential travel and reopening of schools. Third
phase involves relaxation of restrictions, permitting vulnerable
populations to operate.</p>
          <p>2) Defining new metrics – Post-corona world will witness
some significant changes in regulatory controls, and behavioral
drift in personal and professional spheres. Cleanliness standards,
safety standards, infection prevention practices with regular
monitoring and inspection for its assurance are some of the new
terms that will have to be a part of a daily life of the people for
at least the next few months.</p>
          <p>3) Organizational changes – To help essential operations to
function, companies and organizations will have to be prepared
with advanced IT systems (in case of continuation of remote
working), supply of PPE, setting up travel facilities to avoid
public transport, providing behavioral health services, and leave
no stone unturned in overcoming biological, physical, and
emotional challenges. We can see that the above guidelines are
broadly conformal to our model predictions.
4 METHODS OF CONTACT TRACING</p>
          <p>
            As we have already mentioned, contact tracing is probably the
single most important factor in determining the progression of
COVID-19 in a region. We can see from the model that the faster
the contact tracing takes place, the better; the more delay we
have, the higher R becomes. Moreover, our model does not
account for backward contact tracing. In practice however, a
sufficiently high level of detection might not be possible to
achieve with forward contact tracing alone. As much as it is
important, contact tracing is also one of the trickiest aspects to
handle since it can interfere with people’s privacy. In classical
contact tracing, human tracers talk to the confirmed cases and
track down their movements as well as the persons they
interacted with over the past couple of days. This method has
worked well in Ithaca, USA and in Kerala, India. While it is the
least invasive of privacy, it is also the most unreliable since
people might not remember their movements or their interactions
correctly. The time taken in this method is also the maximum. A
more sophisticated variant of this supplements human testimony
with CCTV footage and credit/debit card transaction histories –
this approach is possible only in countries such as USA where
card usage predominates over cash. The most sophisticated
contact tracing algorithms use artificial intelligence together with
location-tracking mobile devices and apps – while they are quick
and fool-proof, they automatically raise issues of privacy and
security. For example, the TraceTogether app in Singapore,
which worked very well during the initial phases of the outbreak,
has not found popularity with many users [
            <xref ref-type="bibr" rid="ref3">10</xref>
            ]. Similarly, India’s
Aarogya Setu has also raised privacy concerns [
            <xref ref-type="bibr" rid="ref4">11</xref>
            ]. Americans
too have expressed their aversion to using contact tracing apps in
a recent poll, with only 43 percent of people saying that they
trusted companies like Google or Apple with their data.
5 ENSURING SOCIAL COMPLIANCE – A
BEHAVIORAL PERSPECTIVE
          </p>
          <p>As the epidemic drags on and on, the continued restrictions
on social activity are becoming more and more unbearable. There
is an increasing tendency, especially among younger people who
are much less at risk of serious symptoms, to violate the
restrictions and spread the disease through irresponsible actions.</p>
          <p>However, City F, a rise in violator behavior can completely
nullify the effects of lockdown over the past few weeks or
months. Here we discuss how public health professionals and
policy makers can resort to behavior/psychological theories to
ensure compliance among the common people. The most widely
used model is Health Belief Model which has been used
successfully in addressing public health challenges. We briefly
discuss the utility of this model in the current situation.</p>
          <p>Health belief model is a theoretical model which hypothesizes
that interventions will be most effective if they target key factors
that influence health behaviors such as perceived susceptibility,
perceived severity, perceived benefits, perceived barriers to
action and exposure to factors that prompt action and
selfefficacy. In general, this model can be used to design short and
long term interventions. The prime components of this model
which are relevant in the current scenario can be outlined as
follows.</p>
          <p>1) Conducting a health need assessment to determine the
target population – The best example is the demarcation of zones
in India depending on the level of risk. Red zone is highest risk,
orange zone is average risk and green zone translates into no
cases since last 21 days. Classification is multifactorial, taking
into account the incidence of cases, the doubling rate and the
limit of testing and surveillance feedback to classify the districts.</p>
          <p>2) Communicating the consequences involved with risky
behaviors in a transparent manner – Central and state ministers
as well as public health authorities are in constant
communication with the masses.</p>
          <p>3) Conveying information about the steps involved in
performing the recommended action and focusing on the benefits
to action – Famous celebrities, in addition to state and central
governments, spread the messages explaining the required steps
cogently and ensuring that it has the maximum reach, especially
among social media-addicted millennials and similar
populations.</p>
          <p>4) Being open about the issues/barriers, identifying them at
early stage and working toward resolution – Activating all sorts
of helpline numbers, email addresses, personal offices etc to
address any grievances around the topic.</p>
          <p>5) Developing skills and providing assistance that encourages
self-efficacy and possibility of positive behavior change –
Adequate arrangements for people from lower socio-economic
strata, stable and trustworthy financial schemes for middle class,
plan to support small business and a means to become a bridge
between the affluent class and the needy class are some of the
ways to foster positive behavior change and develop natural trust.</p>
          <p>Other than health belief model, some theories that can be useful
are:</p>
          <p>Theory of Reasoned Action – This theory implies that an
individual’s behavior is based on the outcomes which the
individual expects as a result of such behavior. In a practical
scenario, if the health officials want the people to follow a
particular trend, let us say based on our model, they need to
reinforce the advantages of targeted behavior and strategically
address the barriers. For instance, to enforce separation minima
even when it is apparently proving ineffective and the cases are
increasing, they can use the examples of Cities B and C to
convince the citizens that violations – and hence violators – can
be responsible for thousands of excess deaths. Trans-theoretical
Model – This model posits that any health behavior change
entails progress through six stages of change: precontemplation,
contemplation, preparation, action, maintenance and
termination. For instance, it was observed that in March, despite
a rise in cases in New York City (NYC), people were not
observing social restrictions the way they should have. Now, we
can see that with passing time, the behavior of the masses
transforms according to the stages of this model</p>
          <p>Precontemplation – This is a stage where people are
typically not cognizant of the fact that their behavior is
troublesome and may cause undesirable consequences. There is
a long way to go before an actual behavior change. This phase
coincides with the commencement of cases in NYC.</p>
          <p>Contemplation – Recognition of the behavior as problematic
begins to surface and a shift begins towards behavior change.</p>
          <p>When the cases started being reported all over media and the
major cause of spread began to surface, citizens started paying
attention to their activities.</p>
          <p>Preparation – People start taking small steps toward
behavior change like in our case, exhibiting hygienic practices
and ensuring six feet separation minima.</p>
          <p>Action – This stage covers the phase where people have just
changed their behavior and have positive intention to maintain
that approach. In this instance, people continue to practice social
restrictions and hygiene positively.</p>
          <p>Maintenance – This stage focuses on maintenance and
continuity toward the adopted approach. Majority of people in
NYC are exhibiting positive behavior and maintaining it
throughout the stages of reopening phases. This is vitally
important to ensure that NYC stops at partial herd immunity like
City D instead of blowing up again like City C.</p>
          <p>Termination – There is lack of motivation to come back to
the unhealthy behaviors and some sections of people across the
country/world will continue practicing good hygiene (though not
social restrictions!) in our day-to-day lives.</p>
          <p>Social Ecological theory – This theory highlights multiple
levels of influences that molds the decision. In our case, let us
say for example that the decision is to maintain sufficient
physical separation once offices are opened up. To successfully
follow this, there is a complex interplay between individual,
relationship, community and societal factors that comes into
action. Law enforcement authorities need to take this into
consideration. A group of individuals when motivated by one
another to follow the guidelines, builds a good connection within
the society, and in turn there is a high probability to build a
healthy network within a defined area. A negative interplay at
different levels of motivation may in turn, prove disastrous and
cause all efforts go down the drain. A perfect illustration of this
in the present condition is how various NGO’s are working in
conjunction with public health authorities to bring about a change
at an individual level by door-to-door campaigning. This propels
the behavior of even the most potentially recalcitrant population
in the most desirable way i.e. wearing masks and gloves,
adopting hand hygiene, being cognizant of symptoms arising in
any member of the family and following quarantine rules in case
of travel from other states.
6 SOCIAL ATTITUDES AND BEHAVIOUR</p>
          <p>
            In this Section we address another important issue related to
the Coronavirus. This is that the widely heterogeneous case
profiles in different regions have often led to “corona contests”
among these regions. Far too often, the residents of better-off
regions are seen heaping scorn on worse-hit regions. We have
selected a tiny handful of representative media articles,
castigating the approaches of India, USA and Sweden, to show
the breadth and vitriol of such commentary [
            <xref ref-type="bibr" rid="ref5">12</xref>
            ][
            <xref ref-type="bibr" rid="ref6">13</xref>
            ][
            <xref ref-type="bibr" rid="ref7">14</xref>
            ]
[
            <xref ref-type="bibr" rid="ref8">15</xref>
            ][
            <xref ref-type="bibr" rid="ref9">16</xref>
            ].A feature common to almost all opinion pieces like this
is that their authors do not have the slightest knowledge of the
issues involved, either epidemiological or economic.
          </p>
          <p>Before embarking on criticisms, we should note that policy
decisions need to be taken in real time, as the situation evolves.</p>
          <p>The authorities do NOT have the benefit of hindsight to decide
on their course of action. Since the virus is a new one, there is no
precedent which can act as a model. Even among emerging
infectious diseases, this latest one is particularly unpredictable,
since minuscule changes in parameters can cause dramatic
changes in the system’s behavior. This phenomenon is best
illustrated by the notional cities, discussed previously. For
example, to get from City A to B, all we did was increase by 50
percent the fraction of people who escaped the contact-tracers’
net. The result was a 30 times (not 30 percent!) increase in the
total number of cases. Similarly, the difference between Cities B
and D is an 11-day delay (recall that the first seven days in the
plots are the seeding period, so they don’t count) in imposing the
lockdown in D. 11 days out of a 200-plus-day run might not
sound like a lot. But, that was enough to create tens of thousands
of additional cases, risk overstressing healthcare systems and at
the same time shorten the epidemic duration by a factor of three.</p>
          <p>Further uncertainty comes from the fact that the parameter
values are changing constantly. It is a well- known fact the
reported fraction of asymptomatic carriers has increased
continuously over the last three months or so. Considering the
sensitivity of this or any other model to parameter values, such
changes can completely invalidate the results of a model as well
as any decision which was made on their basis. Identifying
potential exposures is much easier in a smaller city than a large
or densely populated one. It is also more effective if the cases are
mostly from the sophisticated social class who can use mobile
phone contact tracing apps or otherwise keep (at least mental)
records of their movements and of the people they interacted
with. However, if there is an outbreak among the unsophisticated
class, then even the most skillful contact tracer might run up
against a wall of zero or false information. In such cases there are
limited options that are left to the authorities to proceed in a
conducive manner.</p>
          <p>
            India went into lockdown on 25 March 2020. At that time, the
official figures stated that there were only 571 cases, which made
the decision appear premature to many people. Indeed, a
sevenday delay of lockdown was suggested so that the migrant workers
would have been able to return to their homes. However, when
the lockdown was imposed, the testing had also been woefully
inadequate, with a nationwide total of just 22,694 tests having
been conducted up to that date. If we use the extrapolation
technique of inferring case counts from death counts, then using
the same 1 percent mortality rate and 20 day interval to death, we
find almost 40,000 assumed cases on the day that the lockdown
began. If we go by this figure, then the lockdown wasn’t really
early, and possibly should have been enforced earlier still in
trouble zones such as Mumbai. Certainly, if the figure of 40,000
cases is true, then one further week of normal life (with huge
crowds in trains and railway stations) might have been
disastrous. From the vantage point of today, alternate
arrangements should definitely have been made much earlier for
rehabilitation of the migrant workers. However these
arrangements would have involved considerable complexity in
the prevailing situation, and were certainly not as easy as one
week’s delay in announcing lockdown. Sweden, which has
adopted a controlled herd immunity strategy, has been accused
of playing with fire. It is also possible that the Swedish
authorities are aware that they do not have the contact tracing
capacity required for performing like City A and hence are
attempting something like City D – a faster end of the epidemic
than City B at the expense of a higher case count. To make a
comprehensive analysis of their policy, it is crucial to know not
only the last intricate detail of the epidemiological aspects but
also the details of the economic considerations. That is almost
impossible. On a different note however, we have seen reports
[
            <xref ref-type="bibr" rid="ref10">17</xref>
            ], [
            <xref ref-type="bibr" rid="ref11">18</xref>
            ] stating that the virus has entered into old age homes
and similar establishments, causing hundreds of deaths over
there. Assuming that these reports are not overturned in the
course of time, allowing the ingress of virus into high-risk areas
is an indefensible action, whatever the overall epidemiological
strategy.
          </p>
          <p>
            Finally, extremely important public health factors such as the
racial dependence of susceptibility and/or transmissibility have just
started coming to the surface. Another complete grey area is the
mutations which this new and vicious virus are undergoing and what
effect they might have on the spreading dynamics. Some reports also
reflect that the change in genetic composition due to mutation might
be the reason behind huge differences in the crude infection rate
between countries [
            <xref ref-type="bibr" rid="ref12">19</xref>
            ][
            <xref ref-type="bibr" rid="ref13">20</xref>
            ]. In the absence of a clear picture about
this, any public health measure is all the more likely to be a random
guess with non-zero probabilities of both success and failure. Not
everything about corona is random or outside one’s control though.
          </p>
          <p>Amongst the European countries, we can see that Germany, Austria,
Switzerland, Denmark, Norway and Finland have definitely
managed the epidemic while their neighbors have not, which rules
out some hidden luck factor. The same has happened in Kerala and
Karnataka (also in India). This has been feasible only due to
governmental awareness and hard work, and people’s cooperation.</p>
          <p>Similarly, there are some governments which have been clearly
guilty of negligence or hubris in their management of the disease. It
would also be noteworthy to observe and take lessons from the some
of the new places like Alabama, Arkansas, Florida , Texas etc which
have been recently identified as potential hotspots of this pandemic.</p>
          <p>Lastly, our conclusion best resonates with the message that
coronavirus is not some kind of race but a public health disaster and
we should adopt a unified approach to the fight against it.</p>
          <p>CONCLUSION</p>
          <p>Here, we summarize the take-home messages from this paper:
• A city can reopen only if it is past the peak of cases.</p>
          <p>Reopening must be accompanied by robust contact tracing. The
US CDC has laid down a set of reopening guidelines which are
compatible with our model and its solutions.</p>
          <p>• Incorporation of socio-behavioral theories can come
into play for effective execution of interventional strategies.</p>
          <p>• Efficiency of contact tracing comes at the expense of
people’s privacy – balancing between the two is a delicate
optimization problem.</p>
          <p>• In some regions, restrictions such as masks and six-feet
separation minima must be maintained for a very long time to
come. The public health authorities can ensure compliance by
resorting to socio –behavioral theories/approaches.</p>
          <p> In deploying advanced contact tracing techniques,
significant consideration has to be given for ensuring high
data security and lay down privacy regulations that are
convincing to the users</p>
          <p> Control the spread by swift identification and
isolation of cases accompanied by tracing and quarantine for
at least 2 weeks</p>
          <p> Empowering of individuals and communities by the
government to facilitate efficient capacity building.</p>
          <p> Multidisciplinary coordination, strong leadership to
mobilize communities and take quick decisions coupled with
thoughtful development of operation plans are likely to prove
considerably efficient in handling this pandemic to the best of
our capacity.</p>
          <p>[1] B. Shayak and M. M. Sharma, “Retarded
logistic equation as a universal dynamic model for the
spread of COVID-19,” medRxiv, p.
2020.06.09.20126573, 2020, doi:
10.1101/2020.06.09.20126573.</p>
          <p>[2] E. Okanyene, B. Rader, Y. L. Barnoon, L.</p>
          <p>Goodwin, and J. S. Brownstein, “Analysis of hospital
traffic and search engine data in Wuhan China
indicates early disease activity in the Fall of 2019,”
Harvard, 2020, [Online]. Available:
http://nrs.harvard.edu/urn-3:HUL.InstRepos:42669767.</p>
          <p>[3] CDC, “Forecasting COVID-19 in the US,”
2020.
https://www.cdc.gov/coronavirus/2019ncov/covid-data/forecasting-us.html.</p>
          <p>[4] “Microsoft coronavirus webpage.”
https://www.bing.com/covid.</p>
          <p>[5] “COVID-19 in India.” [Internet]. Available
from: https://www.covid19india.org/.</p>
          <p>[6] L. Star and S. Moghadas, “The Role of
Mathematical Modelling in Public Health Planning and
Decision Making,” Natl. Collab. Cent. Infect. Dis., vol.
(5)2, no. 2, pp. 285–299, 2010.</p>
          <p>[7] Livemint, ““Many states are far short of
COVID-19 testing levels.”
https://www.statnews.com/2020/04/27/coronavirusmany-states-short-of-testing-levels-needed-forsafereopening/.</p>
          <p>
            [
            <xref ref-type="bibr" rid="ref1">8</xref>
            ] Harvard Business Review, “A Plan to
Safely Reopen the U.S. Despite Inadequate Testing.”
https://hbr.org/2020/05/a-plan-to-safely-reopen-the-us-despite-inadequate-testing.
          </p>
          <p>
            [
            <xref ref-type="bibr" rid="ref2">9</xref>
            ] S. Telles, S. K. Reddy, and H. R. Nagendra,
“Variation in False Negative Rate of RT-PCR Based
SARS-CoV-2 Tests by Time Since Exposure,” J.
          </p>
          <p>Chem. Inf. Model., vol. 53, no. 9, pp. 1689–1699,
2019, doi: 10.1017/CBO9781107415324.004.</p>
          <p>
            [
            <xref ref-type="bibr" rid="ref3">10</xref>
            ] M. Lee, “Given low adoption rate of
TraceTogether, experts suggest merging with
SafeEntry or other apps,” Today, 2020.
https://www.todayonline.com/singapore/given-lowadoption-rate-tracetogether-experts-suggest-mergingsafeentry-or-other-apps.
          </p>
          <p>
            [
            <xref ref-type="bibr" rid="ref4">11</xref>
            ] A. Zargar, “Privacy, security concerns as
India forces virus-tracking app on millions,” CBS
News. .
          </p>
          <p>
            [
            <xref ref-type="bibr" rid="ref5">12</xref>
            ] K. Bajpai, “Five lessons of COVID.”
Available: from:
https://timesofindia.indiatimes.com/blogs/toieditpage/five-lessons-of-covid-factors-that-arenegative-for-india-are-having-greater-impact-thanmitigating-ones/..
          </p>
          <p>
            [
            <xref ref-type="bibr" rid="ref6">13</xref>
            ] K. Grimes, “Is politics the reason why Gov.
          </p>
          <p>Newsom is keeping California locked down ?,”
California Globe. .</p>
          <p>
            [
            <xref ref-type="bibr" rid="ref7">14</xref>
            ] R.Guha, “What Modi got wrong on
COVID-19 and how he can fix it.”
https://www.ndtv.com/opinion/5-lessons-for-modi-oncovid-19-by-ramachandra-guha-2227259.
          </p>
          <p>
            [
            <xref ref-type="bibr" rid="ref8">15</xref>
            ] K. Weintraub, “Sweden sticks with
controverial covid approach.,” [Online]. Available:
https://www.webmd.com/lung/news/20200501/sweden
-sticks-with-controversial-covid19-approach.
          </p>
          <p>
            [
            <xref ref-type="bibr" rid="ref9">16</xref>
            ] The Island Now, “Cuomo has failed in his
handling of coronavirus.”
https://theislandnow.com/opinions-100/readers-writecuomo-has-failed-in-handling-of-coronavirus/.
          </p>
          <p>
            [
            <xref ref-type="bibr" rid="ref11">18</xref>
            ]
homes.” .
          </p>
          <p>
            “What’s going wrong in Sweden’s care
[
            <xref ref-type="bibr" rid="ref12">19</xref>
            ] L. van Dorp et al., “Emergence of genomic
diversity and recurrent mutations in SARS-CoV-2,”
Infect. Genet. Evol., vol. 83, no. May, p. 104351, 2020,
doi: 10.1016/j.meegid.2020.104351.
          </p>
          <p>Atention Realignment and Pseudo-Labelling for Interpretable</p>
          <p>Cross-Lingual Classification of Crisis Tweets</p>
          <p>Jitin Krishnan
Department of Computer Science</p>
          <p>George Mason University</p>
          <p>Fairfax, VA
jkrishn2@gmu.edu
ABSTRACT
State-of-the-art models for cross-lingual language understanding
such as XLM-R [7] have shown great performance on benchmark
data sets. However, they typically require some fine-tuning or
customization to adapt to downstream NLP tasks for a domain. In this
work, we study unsupervised cross-lingual text classification task
in the context of crisis domain, where rapidly filtering relevant data
regardless of language is critical to improve situational awareness
of emergency services. Specifically, we address two research
questions: a) Can a custom neural network model over XLM-R trained
only in English for such classification task transfer knowledge to
multilingual data and vice-versa? b) By employing an attention
mechanism, does the model attend to words relevant to the task
regardless of the language? To this goal, we present an attention
realignment mechanism that utilizes a parallel language classifier to
minimize any linguistic diferences between the source and target
languages. Additionally, we pseudo-label the tweets from the target
language which is then augmented with the tweets in the source
language for retraining the model. We conduct experiments using
Twitter posts (tweets) labelled as a ‘request’ in the open source
data set by Appen1, consisting of multilingual tweets for crisis
response. Experimental results show that attention realignment and
pseudo-labelling improve the performance of unsupervised
crosslingual classification. We also present an interpretability analysis by
evaluating the performance of attention layers on original versus
translated messages.</p>
          <p>Social Media, Crisis Management, Text Classification,
Unsupervised Cross-Lingual Adaptation, Interpretability
ACM Reference Format:
Jitin Krishnan, Hemant Purohit, and Huzefa Rangwala. 2020. Attention
Realignment and Pseudo-Labelling for Interpretable Cross-Lingual
Classification of Crisis Tweets. In Proceedings of KDD Workshop on Knowledge-infused
Mining and Learning (KiML’20). , 7 pages. https://doi.org/10.1145/nnnnnnn.
nnnnnnn
1https://appen.com/datasets/combined-disaster-response-data/</p>
          <p>Hemant Purohit
Department of Information</p>
          <p>Sciences &amp; Technology
George Mason University</p>
          <p>Fairfax, VA
hpurohit@gmu.edu</p>
          <p>Huzefa Rangwala
Department of Computer Science</p>
          <p>George Mason University</p>
          <p>Fairfax, VA
rangwala@gmu.edu
1</p>
          <p>INTRODUCTION
Social media platforms such as Twitter provide valuable information
to aid emergency response organizations in gaining real-time
situational awareness during the sudden onset of crisis situations [4].</p>
          <p>
            Extracting critical information about afected individuals,
infrastructure damage, medical emergencies, or food and shelter needs
can help emergency managers make time-critical decisions and
allocate resources eficiently [
            <xref ref-type="bibr" rid="ref14 ref15 ref23 ref24 ref8">15, 21, 22, 30, 31, 36</xref>
            ]. Researchers
have designed numerous classification models to help towards this
humanitarian goal of converting real-time social media streams into
actionable knowledge [
            <xref ref-type="bibr" rid="ref15 ref19 ref21 ref22">1, 22, 26, 28, 29</xref>
            ]. Recently, with the advent
of multilingual models such as multilingual BERT [
            <xref ref-type="bibr" rid="ref2">9</xref>
            ] and XLM
[
            <xref ref-type="bibr" rid="ref13">20</xref>
            ], researchers have started adopting them to multilingual
disaster tweets [
            <xref ref-type="bibr" rid="ref18">6, 25</xref>
            ]. Since XLM-R [7] has been shown to be the most
superior model in cross-lingual language understanding, we
restrict our work to this model to explore the aspects of cross-lingual
transfer of knowledge and interpretability.
          </p>
          <p>In this work, we address two questions. First is to examine
whether XLM-R is efective in capturing multilingual knowledge by
constructing a custom model over it to analyze if a model trained
using English-only tweets will generalize to multilingual data and
vice-versa. Social media streams are generally diferent from other
text, given the user-generated content. For example, tweets are
usually short with possibly errors and ambiguity in the behavioral
expressions. These properties in turn make the classification task or
extracting representations a bit more challenging. Second question
is to examine whether word translations will be equally attended
by the attention layers. For instance, the words with higher
attention weights in a sentence in Haitian Creole such as “Tanpri nou
bezwen tant avek dlo nou zon silo mesi” should align with the words
in its corresponding translated tweet in English “Please, we need
tents and water. We are in Silo, Thank you!”. Our core idea is that if
‘dlo’ in the Haitian tweet has a higher weight, so should its English
translation ‘water’. This word-level language agnostic property can
promote machine learning models to be more interpretable. This
also brings several benefits to downstream tasks such as knowledge
graph construction using keywords extracted from tweets. In
situations where data is available only in one language, this similarity in
attention would still allow us to extract relevant phrases in
crosslingual settings. To the best of our knowledge in crisis analytics
domain, aligning attention in cross-lingual setting is not attempted
before. In this work, we focus our classification experiments only
to tweets containing ‘request’ intent, which will be expanded to
other behaviors, tasks, and datasets in the future.</p>
          <p>
            Contributions: We propose a novel attention realignment method
which promotes the task classifier to be more language agnostic,
which in turn tests the efectiveness of multilingual knowledge
capture of XLM-R model for crisis tweets; and a pseudo-labelling
procedure to further enhance the model’s generalizability. Furher,
incorporating the attention-based mechanism allows us to perform
an interpretability analysis on the model, by comparing how words
are attended in the original versus translated tweets.
There are numerous prior works (c.f. surveys [
            <xref ref-type="bibr" rid="ref7">4, 14</xref>
            ]) that focus
specifically on disaster related data to perform classification and
other rapid assessments during an onset of a new disaster event.
          </p>
          <p>
            Crisis period is an important but challenging situation, where
collecting labeled data during an ongoing event is very expensive. This
problem led to several works on domain adaptation techniques in
which machine learning models can learn and generalize to unseen
crisis event [
            <xref ref-type="bibr" rid="ref11 ref16 ref3">3, 10, 18, 23</xref>
            ]. In the context of crisis data, Nguyen et al.
[
            <xref ref-type="bibr" rid="ref21">28</xref>
            ] designed a convolutional neural network model which does not
require any feature engineering and Alam et al. [1] designed a CNN
architecture with adversarial training on graph embeddings.
Krishnan et al. [
            <xref ref-type="bibr" rid="ref12">19</xref>
            ] showed that sharing a common layer for multiple
tasks can improve performance of tasks with limited labels.
          </p>
          <p>
            In multilingual or cross-lingual direction, many works [
            <xref ref-type="bibr" rid="ref1 ref10">8, 17</xref>
            ]
tried to align word embeddings (such as fastText [
            <xref ref-type="bibr" rid="ref20">27</xref>
            ]) from diferent
languages into the same space so that a word and its translations
have the same vector. These models are superseded by models such
as multilingual BERT [
            <xref ref-type="bibr" rid="ref2">9</xref>
            ] and XLM-R [7] that produce contextual
embeddings which can be pretrained using several languages
together to achieve impressive performance gains on multilingual
use-cases.
          </p>
          <p>
            Attention mechanism [
            <xref ref-type="bibr" rid="ref17">2, 24</xref>
            ] is one of the most widely used
methods in deep learning that can construct a context vector by
weighing on the entire input sequence which improves over previous
sequence-to-sequence models [
            <xref ref-type="bibr" rid="ref6">13, 34, 35</xref>
            ]. As the model produces
weights associated with each word in a sentence, this allows for
evaluating interpretability by comparing the words that are given
priority in original versus translated tweets.
          </p>
          <p>
            With more and more machine learning systems being adopted
by diverse application domains, transparency in decision-making
inevitably becomes an essential criteria, especially in high-risk
scenarios [
            <xref ref-type="bibr" rid="ref5">12</xref>
            ] where trust is of utmost importance. With deep
neural networks, including natural language systems, shown to
be easily fooled [
            <xref ref-type="bibr" rid="ref9">16</xref>
            ], there has been many promising ideas that
empower machine learning systems with the ability to explain
their predictions [
            <xref ref-type="bibr" rid="ref25">5, 32</xref>
            ]. Gilpin et al. [
            <xref ref-type="bibr" rid="ref4">11</xref>
            ] presents a survey of
interpretability in machine learning, which provides a taxonomy of
research that addresses various aspects of this problem. Similar to
the work by Ross et al. [33], we employ an attention-based approach
to evaluate model interpretability applied to the crisis-domain.
          </p>
          <p>METHODOLOGY</p>
          <p>Problem Statement: Unsupervised</p>
          <p>Cross-Lingual Crisis Tweet Classification
Consider tweets in language A and their corresponding translated
tweets in language B. The task of unsupervised cross-lingual
classiifcation is to train a classifier using the data only from the source
language and predict the labels for the data in the target language.</p>
          <p>This experimental set up is usually represented as  →  for
training a model using A and testing on B or  →  for training a
model using B and testing on A.  refers to the data and  refers
to the ground truth labels. The multilingual dataset used in our
experiments consists of original multilingual ( ) tweets and their
translated () tweets in English. To summarize:
Experiment  ( →  ):
Input:  ,  , 
Output:  ←  ( )
Experiment  ( → ):
Input:  ,  , 
Output:  ←  ( )
In the following sections, we propose two methodologies to
enhance cross-lingual classification: 1) Attention Realignment and 2)
Pseudo-Labelling. Attention realignment utilizes a language
classifier which is trained in parallel to realign the attention layer of
the task classifier such that the weights are more geared towards
task-specific words regardless of the language. Pseudo-Labelling
further enhances the classifier by adding high quality seeds from
the target language that are pseudo-labelled by the task classifier.</p>
          <p>Attention Realignment by Parallel</p>
          <p>Language Classifier
As depicted in Fig 2, model on the left side is the task classifier and
the model on the right side is a language classifier that is trained in
parallel. The purpose of this language classifier is to pick up aspects
that is missed by the XLM-R model. This could be tweet-specific,
crisis-specific, or other linguistic nuances that can separate original
tweets and translated tweets. Note that semantically, translated
words are expected to have similar XLM-R representations.</p>
          <p>Attention realignment is a mechanism we introduce to promote
the task classifier to be more language independent. The main idea
is that the words that are given higher attention in a language</p>
          <p>Tweets translated to English (‘message’
column in the dataset)
Multilingual Tweets (‘original’ column
in the dataset)
Attention Layer
A component that uses Task-specific
data. i.e., + and − ‘Request’ tweets
A component that uses
Languagespecific data. i.e.,  and  tweets
Activation from the BiLSTM layer
Hyperparameters
representation in language agnostic models; while the sentence
structure, grammar, and other nuances can vary. We enforce this
rule by constructing two operations:
−→
−→
−  
−→ ′
−→ ′
, 0, 1
!
classifier should be less important in a task classifier. For example,
‘dlo’ in Haitian and ‘water’ in English should have the same vector
where  is a hyperparameter to tune the amount of
subtraction needed for the task classifier. Similarly, for the language
classifier,
−→ ′
−→ ′
(2) Attention Loss: Along with attention diference, the model
can also be trained by inserting an additional loss function
term that penalizes the similarity between the attention
weights from the two classifiers. We use the Frobenius norm.</p>
          <p>= ∥−→  −→ ′ ∥2
 = ∥−→ −→ ′ ∥2
 ( ) =   +   +   +</p>
          <p>1 Õ
where  is the hyperparameter to tune the attention loss
weight,  is the hyperparameter to tune the joint training
loss, and  denotes the binary cross entropy loss,</p>
          <p>= −  =1 [ log ˆ + (1 −  ) log(1 − ˆ )]
It is important to note that the Frobenius norm is not simply
between the attention weights of the two models but rather
between the attention weights produced by the two models
on the same input tweet. For example, for a given tweet, the
task classifier attends more to task-specific words and the
language classifier attends to language-specific words. So
the mechanism makes sure that they are distinct.
(5)
3.4 Pseudo-Labelling
To enhance the model further, we pseudo-label the data in the
target language. For example, if we are training a model using the
English tweets, we use the original tweets before translation for
pseudo-labelling. The idea is simply to gather high-quality seeds
from the target to retrain the model. Note that, we still do not use
any target labels here; still following the unsupervised goal. Thus,
for retraining model M1 for  →  , the new dataset would consist
of: + and  + as positive examples and − and  −</p>
          <p>as negative examples.
3.5</p>
          <p>XLM-R Usage
The recommended feature usage of XLM-R2 is either by fine-tuning
to the task or by aggregating features from all the 25 layers. We
employ the later to extract the multilingual embeddings for the
tweets.
4
2https://github.com/facebookresearch/XLM

Deep Learning Library
Optimizer
30
Keras
Adam [ = 0.005, 1 = 0.9,
2 = 0.999,  = 0.01]
100
0.2
10
32
1
0.1
0.01</p>
          <p>Table 3: Implementation Details</p>
          <p>We use the open source dataset from Appen3 consisting of
multilingual crisis response tweets. The dataset statistics for tweets with
‘request’ behavior labels is shown in Table 2. For all the experiments,
the dataset is balanced for each split.</p>
          <p>Each experiment is denoted as  → , where  is the data that
is used to train the model and  is the data that is used for testing
the model. For example,  →  means we train the model using
English tweets and test on multilingual tweets.</p>
          <p>Models are implemented in Keras and the details are shown in
table 3. Hyperparameters  ,  ,  , and  are not exhaustively
tuned; we leave this exploration for future work.</p>
          <p>→ 
 →</p>
          <p>Baseline
59.98
(80.57)
60.93
(70.07)</p>
          <p>Model M1
62.53
(77.02)
65.69
(63.50)</p>
          <p>Model M2
66.79
(82.39)
70.95
(73.84)
Table 4: Performance Comparison (Accuracy in %) for
 →   ( → ).</p>
          <p>Baseline = XLMR + BiLSTM + Attention.</p>
          <p>Model M1 = Baseline + Attention Realignment.</p>
          <p>Model M2 = Model M1 + Pseudo-Labelling.
3https://appen.com/datasets/combined-disaster-response-data/</p>
          <p>The attention weights for both task and language classifiers
are manipulated by each other during training by a process
of subtraction (attention diference) as well a loss component
(attention loss). See section 3.3.
(3) Model M2: Adding the pseudo-labelling procedure to model</p>
          <p>M1 produces model M2. Using Model M1 which is trained
to be language agnostic, tweets from the target languages
are pseudo-labelled. High quality seeds are selected (using
Model M1 &gt;0.7) and augmented to the original training
dataset to retrain the task classifier.</p>
          <p>Results show that, for cross-lingual evaluation on  →  ,
model M1 outperforms the baseline by +4.3% and model M2
outperforms by +11.4%. On  → , model M1 outperforms the baseline
by +7.8% and model M2 outperforms by +16.5%. This shows that
both models are efective in cross-lingual crisis tweet classification.</p>
          <p>
            An interesting observation to note is that using attention
realignment alone decreased the classification performance in the same
language, which is brought back up by pseudo-labelling. These
scores are shown in brackets in table 4. A deeper investigation in
this direction on various other tasks can shed more light on the
impact of realignment mechanism.
We follow a similar attention architecture shown in [
            <xref ref-type="bibr" rid="ref11">18</xref>
            ]. The
context vector is constructed as a result of dot product between the
attention weights and word activations. This represents the
interpretable layer in our architecture. The attention weights represent
the importance of each word in the classification process. Two
examples are shown in figure 3. In the first example, both  → 
and  →  give attention to the word ‘hungry’ (i.e., ‘grangou’ in
Haitian Creole). Note that these two are results from the models
that are trained in the same language in which they are tested; thus,
expecting an ideal performance. For the baseline model in the
crosslingual set-up  →  , although it correctly predicts the label, the
attention weights are more spread apart. In model M2 with
attention realignment and pseudo-labelling, although with some spread,
the attention weights are shifted more toward ‘grangou’. Similarly
in example 2, the attention weights in the baseline model are more
spread apart. Cross-lingual performance of model M2 aligns more
with  →  and  →  . These examples show the importance
of having interpretability as a key criterion in cross-lingual crisis
tweet classification problems; which can also be used for
downstream tasks such as extracting relevant keywords for knowledge
graph construction.
          </p>
          <p>CONCLUSION
We presented a novel approach for unsupervised cross-lingual
crisis tweet classification problem using a combination of attention
realignment mechanism and a pseudo-labelling procedure (over
the state-of-the-art multilingual model XLM-R) to promote the task
classifier to be more language agnostic. Performance evaluation
showed that both models M1 and M2 outperformed the baseline by
+4.3% and +11.4% respectively for cross-lingual text classification
from English to Multilingual. We also presented an
interpretability analysis by comparing the attention layers of the models. It
shows the importance of incorporating a word-level language
agnostic characteristic in the learning process, when training data
is available only in one language. Performing extensive
hyperparameter tuning and expanding the idea to other tasks (including
cross-task/multi-task) are left as future work. We also plan another
direction for future work as to incorporate the human-engineered
knowledge from the multilingual knowledge graphs such as
BabelNet in our model architecture that could improve the learning
of similar concepts across languages critical to the crisis response
agencies.</p>
          <p>Reproducibility: Source code is available available at: https://
github.com/jitinkrishnan/Cross-Lingual-Crisis-Tweet-Classification
Authors would like to thank U.S. National Science Foundation
grants IIS-1815459 and IIS-1657379 for partially supporting this
research.
[1] Firoj Alam, Shafiq Joty, and Muhammad Imran. 2018. Domain adaptation with
adversarial training and graph embeddings. arXiv preprint arXiv:1805.05151
(2018).
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural
machine translation by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473 (2014).
[3] John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation
with structural correspondence learning. In Proceedings of the 2006 conference on
empirical methods in natural language processing. 120–128.
[4] Carlos Castillo. 2016. Big crisis data: social media in disasters and time-critical</p>
          <p>situations. Cambridge University Press.
[5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter</p>
          <p>Abbeel. 2016. Infogan: Interpretable representation learning by information
maximizing generative adversarial nets. In Advances in neural information processing
systems. 2172–2180.
[6] Jishnu Ray Chowdhury, Cornelia Caragea, and Doina Caragea. 2020.
Cross</p>
          <p>Lingual Disaster-related Multi-label Tweet Classification with Manifold Mixup.</p>
          <p>In Proceedings of the 58th Annual Meeting of the Association for Computational</p>
          <p>Linguistics: Student Research Workshop. 292–298.
[7] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary,
Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer,
and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning
at scale. arXiv preprint arXiv:1911.02116 (2019).
[33] Andrew Slavin Ross, Michael C Hughes, and Finale Doshi-Velez. 2017. Right for
the right reasons: Training diferentiable models by constraining their
explanations. arXiv preprint arXiv:1703.03717 (2017).
[34] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural
net</p>
          <p>works. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.
[35] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
with neural networks. In Advances in neural information processing systems. 3104–
3112.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Alexis</given-names>
            <surname>Conneau</surname>
          </string-name>
          , Guillaume Lample,
          <string-name>
            <surname>Marc'Aurelio Ranzato</surname>
            , Ludovic Denoyer, and
            <given-names>Hervé</given-names>
          </string-name>
          <string-name>
            <surname>Jégou</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Word Translation Without Parallel Data</article-title>
          .
          <source>arXiv preprint arXiv:1710.04087</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Yaroslav</given-names>
            <surname>Ganin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Victor</given-names>
            <surname>Lempitsky</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Unsupervised domain adaptation by backpropagation</article-title>
          .
          <source>arXiv preprint arXiv:1409.7495</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Leilani H Gilpin</surname>
            , David Bau, Ben Z Yuan, Ayesha Bajwa,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Specter</surname>
            , and
            <given-names>Lalana</given-names>
          </string-name>
          <string-name>
            <surname>Kagal</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Explaining explanations: An overview of interpretability of machine learning</article-title>
          .
          <source>In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA)</source>
          . IEEE,
          <fpage>80</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>David</given-names>
            <surname>Gunning</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Explainable artificial intelligence (xai)</article-title>
          .
          <source>Defense Advanced Research Projects Agency (DARPA)</source>
          ,
          <source>nd Web 2</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9</source>
          ,
          <issue>8</issue>
          (
          <year>1997</year>
          ),
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Muhammad</surname>
            <given-names>Imran</given-names>
          </string-name>
          , Carlos Castillo, Fernando Diaz, and
          <string-name>
            <given-names>Sarah</given-names>
            <surname>Vieweg</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Processing social media messages in mass emergency: A survey</article-title>
          .
          <source>ACM Computing Surveys (CSUR) 47</source>
          ,
          <issue>4</issue>
          (
          <year>2015</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Muhammad</surname>
            <given-names>Imran</given-names>
          </string-name>
          , Prasenjit Mitra, and
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Castillo</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Twitter as a lifeline: Human-annotated twitter corpora for NLP of crisis-related messages</article-title>
          .
          <source>arXiv preprint arXiv:1605.05894</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Robin</given-names>
            <surname>Jia</surname>
          </string-name>
          and
          <string-name>
            <given-names>Percy</given-names>
            <surname>Liang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Adversarial examples for evaluating reading comprehension systems</article-title>
          .
          <source>arXiv preprint arXiv:1707.07328</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Armand</surname>
            <given-names>Joulin</given-names>
          </string-name>
          , Piotr Bojanowski, Tomas Mikolov, Hervé Jégou, and
          <string-name>
            <given-names>Edouard</given-names>
            <surname>Grave</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Loss in translation: Learning bilingual word mapping with a retrieval criterion</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>07745</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Jitin</surname>
            <given-names>Krishnan</given-names>
          </string-name>
          , Hemant Purohit, and
          <string-name>
            <given-names>Huzefa</given-names>
            <surname>Rangwala</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Diversity-Based Generalization for Neural Unsupervised Text Classification under Domain Shift</article-title>
          . https://arxiv.org/pdf/
          <year>2002</year>
          .10937.
          <string-name>
            <surname>pdf</surname>
          </string-name>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Jitin</surname>
            <given-names>Krishnan</given-names>
          </string-name>
          , Hemant Purohit, and
          <string-name>
            <given-names>Huzefa</given-names>
            <surname>Rangwala</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Unsupervised and Interpretable Domain Adaptation to Rapidly Filter Social Web Data for Emergency Services</article-title>
          . arXiv preprint arXiv:
          <year>2003</year>
          .
          <volume>04991</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Guillaume</given-names>
            <surname>Lample</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alexis</given-names>
            <surname>Conneau</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Cross-lingual language model pretraining</article-title>
          . arXiv preprint arXiv:
          <year>1901</year>
          .
          <volume>07291</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Kathy</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ankit</given-names>
            <surname>Agrawal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Alok</given-names>
            <surname>Choudhary</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Real-time disease surveillance using twitter data: demonstration on flu and cancer</article-title>
          .
          <source>In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          .
          <volume>1474</volume>
          -
          <fpage>1477</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Hongmin</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Doina</given-names>
            <surname>Caragea</surname>
          </string-name>
          , Cornelia Caragea, and
          <string-name>
            <given-names>Nic</given-names>
            <surname>Herndon</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Disaster response aided by tweet classification with a domain adaptation approach</article-title>
          .
          <source>Journal of Contingencies and Crisis Management</source>
          <volume>26</volume>
          ,
          <issue>1</issue>
          (
          <year>2018</year>
          ),
          <fpage>16</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Zheng</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ying Wei</surname>
          </string-name>
          , Yu Zhang, and
          <string-name>
            <given-names>Qiang</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Hierarchical attention transfer network for cross-domain sentiment classification</article-title>
          .
          <source>In Thirty-Second AAAI Conference on Artificial Intelligence .</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Minh-Thang</surname>
            <given-names>Luong</given-names>
          </string-name>
          , Hieu Pham, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Efective approaches to attention-based neural machine translation</article-title>
          .
          <source>arXiv preprint arXiv:1508.04025</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Guoqin</given-names>
            <surname>Ma</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Tweets Classification with BERT in the Field of Disaster Management</article-title>
          . https://pdfs.semanticscholar.org/d226/ 185fa1e14118d746cf0b04dc5be8f545ec24.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Reza</surname>
            <given-names>Mazloom</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Hongmin</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Doina</given-names>
            <surname>Caragea</surname>
          </string-name>
          , Cornelia Caragea, and
          <string-name>
            <given-names>Muhammad</given-names>
            <surname>Imran</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A Hybrid Domain Adaptation Approach for Identifying CrisisRelevant Tweets</article-title>
          .
          <source>International Journal of Information Systems for Crisis Response and Management (IJISCRAM) 11</source>
          ,
          <issue>2</issue>
          (
          <year>2019</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Edouard Grave, Piotr Bojanowski,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Puhrsch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Armand</given-names>
            <surname>Joulin</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Advances in Pre-Training Distributed Word Representations</article-title>
          .
          <source>In Proceedings of the International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Dat</given-names>
            <surname>Tien</surname>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          , Kamela Ali Al Mannai,
          <string-name>
            <given-names>Shafiq</given-names>
            <surname>Joty</surname>
          </string-name>
          , Hassan Sajjad, Muhammad Imran, and
          <string-name>
            <given-names>Prasenjit</given-names>
            <surname>Mitra</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Rapid classification of crisis-related data on social networks using convolutional neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1608.03902</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Ferda</surname>
            <given-names>Ofli</given-names>
          </string-name>
          , Patrick Meier, Muhammad Imran, Carlos Castillo, Devis Tuia, Nicolas Rey, Julien Briant, Pauline Millet, Friedrich Reinhard,
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Parkan</surname>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>Combining human computing and machine learning to make sense of big (aerial) data for disaster response</article-title>
          .
          <source>Big data 4</source>
          ,
          <issue>1</issue>
          (
          <year>2016</year>
          ),
          <fpage>47</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Bahman</given-names>
            <surname>Pedrood</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hemant</given-names>
            <surname>Purohit</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Mining help intent on twitter during disasters via transfer learning with sparse coding</article-title>
          .
          <source>In International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation</source>
          . Springer,
          <fpage>141</fpage>
          -
          <lpage>153</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Hemant</surname>
            <given-names>Purohit</given-names>
          </string-name>
          , Carlos Castillo, Fernando Diaz, Amit Sheth, and
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Meier</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Emergency-relief coordination on social media: Automatically matching resource requests and ofers</article-title>
          .
          <source>First Monday</source>
          <volume>19</volume>
          ,
          <issue>1</issue>
          (Dec.
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>Marco</given-names>
            <surname>Tulio</surname>
          </string-name>
          <string-name>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sameer</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Carlos</given-names>
            <surname>Guestrin</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>" Why should i trust you?" Explaining the predictions of any classifier</article-title>
          .
          <source>In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining</source>
          .
          <volume>1135</volume>
          -
          <fpage>1144</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>