=Paper=
{{Paper
|id=Vol-2657/xproceedings
|storemode=property
|title=None
|pdfUrl=https://ceur-ws.org/Vol-2657/xproceedings.pdf
|volume=Vol-2657
}}
==None==
Proceedings of the ACM SIGKDD Workshop on Knowledge-infused Mining and Learning for Social Impact Editors: Manas Gaur, Alejandro Jaimes, Fatma Özcan, Srinivasan Parthasarathy, Sameena Shah, Amit Sheth & Biplav Srivastava KiML 2020 AUGUST 24 SAN DIEGO , CA First International Workshop on Advancing Decision making in Health, Crisis Response, and Finance Co-located with http://kiml2020.aiisc.ai/ 26th ACM Conference on Knowledge Discovery and Data Mining KDD 2020, San Diego, California Proceedings of the ACM SIGKDD Workshop on Knowledge-infused Mining and Learning (KiML) Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee. These proceedings are not included in the ACM Digital Library. KiML’20, August 24, 2020, San Diego, California, USA. Copyright (c) 2020 held by the author(s). In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the ACM SIGKDD 2020 Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, California, USA, August 24, 2020. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Copyright © The Authors, 2020. ACM SIGKDD Workshop on Knowledge-infused Mining and Learning (KiML) Organizers: Manas Gaur (AI Institute, University of South Carolina) Alejandro(Alex) Jaimes (Dataminr Inc. NYC) Fatma Özcan (IBM Research Almaden) Srinivasan Parthasarathy (Ohio State University) Sameena Shah (JP Morgan NYC) Amit Sheth (AI Institute, University of South Carolina) Biplav Srivastava (IBM Chief Analytics Office, NYC) Program Committee: Nitin Agarwal (University of Arkansas) Amanuel Alambo (Kno.e.sis Center) Shreyansh Bhatt (Amazon) Vasilis Efthymiou (IBM Research) Utkarshani Jaimini (AI Institute, University of South Carolina) Ugur Kurşuncu (AI Institute, University of South Carolina) Sarasi Lalithsena (IBM Watson) Chuan Lei (IBM Research) Quanzhi Li (Alibaba Group) Xiaomo Liu (S&P Global Ratings) Yong Liu (Outreach.io) Raghava Mutharaju (IIIT Delhi) Arindam Pal (Data61, CSIRO) Sujan Perera (Amazon) Hemant Purohit (George Mason University) Kaushik Roy (AI Institute, University of South Carolina) Valerie Shalin (Wright State University) Kai Shu (Arizona State University) Nikhita Vedula (Ohio State University) Ruwan Wickramarachchi (AI Institute, University of South Carolina) Ke Zhang (Dataminr Inc.) Jinjin Zhao (Amazon) Webmaster: Vishal Pallagani (AI Institute, University of South Carolina) Ibrahim Salman (AI Institute, University of South Carolina) Copyright (c) 2020 held by the author(s). In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the ACM SIGKDD 2020 Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, California, USA, August 24, 2020. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Preface Research in artificial intelligence and data science is accelerating rapidly due to an unprecedented explosion in the amount of information on the web. In parallel, we noticed immense growth in the construction and utility of the knowledge network from Google, Netflix, NSF, and NIH. However, current methods risk an unsatisfactory ceiling of applicability due to shortcomings in bringing homogeneity between knowledge graphs, data mining, and deep learning. In this changing world, retrospective studies for building state-of-the-art AI and Data science systems have raised concerns on trust, traceability, and interactivity for prospective applications in healthcare, finance, and crisis response. We believe the paradigm of knowledge-infused mining and learning would account for both pieces of knowledge that accrue from domain expertise and guidance from physical models. Further, it will allow the community to design new evaluation strategies that assess robustness and fairness across all comparable state-of-the-art algorithms. The Workshop on Knowledge-infused Mining and Learning for Social Impact was centered around the following thematic components: (a) Data Management: includes resource management, resource discovery across heterogeneous and inconsistent data resources. (b) Data Usage: includes methods and systems for visualization, representations, reasoning, and interaction. (c) Evaluation: will bring together researchers involved at the intersection of databases, semantic web, information systems, and AI to create new approaches and tools to benefit a broad range of policymakers (e.g. mental health professions, education practitioners, emergency responders, and economists). The workshop will bring together researchers and practitioners from both academia and industry who are interested in the creation and use of knowledge graphs in understanding online conversations on crisis response (e.g., COVID-19), public health (e.g., social network analysis for mental health insights), and finance (e.g., mining insights on the financial impact (recession, unemployment) of COVID-19 using twitter or organizational data). Additionally, we encourage researchers and practitioners from the areas of human-centered computing, interaction and reasoning, statistical relational mining and learning, intelligent agent systems, semantic social network analysis, deep graph learning, and recommendation systems. The main program of KiML’20 consist of seven papers, selected out of thirteen submissions, covering topics related to knowledge-enabled feature elicitation, adversarial learning, crisis response, public health, and COVID-19. We sincerely thank the authors of the submissions as well as the attendees of the workshop. We wish to thank the members of our program committee for their help in selecting high-quality papers. Furthermore, we are grateful to Manuela Veloso, Sriraam Natarajan, Jose Ambite, and Pieter De Leenheer for giving keynote presentations on their recent work on Symbiotic Autonomy, Human Allied Probabilistic Learning, Biomedical Data Science, and Data Intelligence. Manas Gaur, Alejandro Jaimes, Fatma Özcan, Srinivasan Parthasarathy, Sameena Shah, Amit Sheth, and Biplav Srivastava August 2020 Copyright (c) 2020 held by the author(s). In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the ACM SIGKDD 2020 Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, California, USA, August 24, 2020. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Table of Contents Invited Talks Symbiotic Autonomy: Knowing When and What to Learn from Experience Manuela M. Veloso ……………………………………………………………………………………... 1 Human Allied Probabilistic Learning Sriraam Natarajan …………………………………………………………………………………….. 2 Data Intelligence in the 2020s Pieter De Leenheer …………………………………………………………………………………….. 3 Semantics in Biomedical Data Science Jose Luis Ambite ....……………………………………………………………………………………... 4 Research Papers Textual Evidence for the Perfunctoriness of Independent Medical Reviews Adrian Brasoveanu, Megan Moodie and Rakshit Agrawal ……………………………………………5 Knowledge Intensive Learning of Generative Adversarial Networks Devendra Dhami, Mayukh Das and Sriraam Natarajan ……………………………………………. 14 Depressive, Drug Abusive, or Informative: Knowledge-aware Study of News Exposure during COVID-19 Outbreak Amanuel Alambo, Manas Gaur and Krishnaprasad Thirunarayan………………………………….20 Cost Aware Feature Elicitation Srijita Das, Rishabh Iyer and Sriraam Natarajan……………………………………………………26 A New Delay Differential Equation Model for COVID-19 B Shayak, Mohit Manoj Sharma and Manas Gaur…………………………………………………....32 Public Health Implications of a delay differential equation model for COVID19 Mohit Manoj Sharma and B Shayak……………....…………………………………………………....36 Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets Jitin Krishnan, Hemant Purohit, Huzefa Rangwala…………………………………………………..42 Keynote Talk 1 Symbiotic Autonomy: Knowing When and What to Learn from Experience Manuela M. Veloso Head, JPMorgan AI Research Herbert A. Simon University Professor, School of Computer Science Carnegie Mellon University manuela.veloso@jpmchase.com Abstract: The talk will present work on novel human-AI interaction, in which humans and AI complement each other in their knowledge and learning. I will discuss examples in autonomous mobile service robots and in the financial domain. I will conclude with a brief discussion of multiple forms of available knowledge for AI systems that continuously learn from experience. Bio: Manuela M. Veloso is the Head of J.P. Morgan AI Research, which pursues fundamental research in areas of core relevance to financial services, including data mining and cryptography, machine learning, explainability, and human-AI interaction. J.P. Morgan AI Research partners with applied data analytics teams across the firm as well as with leading academic institutions globally. Professor Veloso is on leave from Carnegie Mellon University as the Herbert A. Simon University Professor in the School of Computer Science, and the past Head of the Machine Learning Department. With her students, she had led research in AI, with a focus on robotics and machine learning, having concretely researched and developed a variety of autonomous robots, including teams of soccer robots, and mobile service robots. Her robot soccer teams have been RoboCup world champions several times, and the CoBot mobile robots have autonomously navigated for more than 1,000km in university buildings. Professor Veloso is the Past President of AAAI, (the Association for the Advancement of Artificial Intelligence), and the co-founder, Trustee, and Past President of RoboCup. Professor Veloso has been recognized with multiple honors, including being a Fellow of the ACM, IEEE, AAAS, and AAAI. She is the recipient of several best paper awards, the Einstein Chair of the Chinese Academy of Science, the ACM/SIGART Autonomous Agents Research Award, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Professor Veloso earned a Bachelor and Master of Science degrees in Electrical and Computer Engineering from Instituto Superior Tecnico in Lisbon, Portugal, a Master of Arts in Computer Science from Boston University, and Master of Science and Ph.D. in Computer Science from Carnegie Mellon University. See www.cs.cmu.edu/~mmv/Veloso.html for her scientific publications. Keynote Talk 2 Human Allied Probabilistic Learning Sriraam Natarajan Director, Center for Machine Learning Erik Jonsson School of Engineering and Computer Science The University of Texas at Dallas sriraam.natarajan@utdallas.edu Abstract: Historically, Artificial Intelligence has taken a symbolic route for representing and reasoning about objects at a higher-level or a statistical route for learning complex models from large data. To achieve true AI, it is necessary to make these different paths meet and enable seamless human interaction. First, I briefly will introduce learning from rich, structured, complex, and noisy data. Next, I will present the recent progress that allows for more reasonable human interaction where the human input is taken as “advice” and the learning algorithm combines this advice with data. The advice can be in the form of qualitative influences, preferences over labels/actions, privileged information obtained during training, or simple precision-recall trade-off. Finally, I will outline our recent work on "closing-the-loop" where information is solicited from humans as needed that allows for seamless interactions with the human expert. While I will discuss these methods primarily in the context of probabilistic and relational learning, I will also present our results on reinforcement learning and inverse reinforcement learning. Bio: Dr. Sriraam Natarajan is an Associate Professor and the Director of the Center for ML at the Department of Computer Science at the University of Texas Dallas. He was previously an Associate Professor and earlier an Assistant Professor at Indiana University, Wake Forest School of Medicine, a post-doctoral research associate at the University of Wisconsin-Madison, and had graduated with his Ph.D. from Oregon State University. His research interests lie in the field of Artificial Intelligence, with emphasis on Machine Learning, Statistical Relational Learning and AI, Reinforcement Learning, Graphical Models, and Biomedical Applications. He has received the Young Investigator award from US Army Research Office, Amazon Faculty Research Award, Intel Faculty Award, XEROX Faculty Award, Verisk Faculty Award, and the IU Trustees Teaching Award from Indiana University. He is the program co-chair of SDM 2020 and ACM CoDS-COMAD 2020 conferences. He is the specialty chief editor of Frontiers in ML and AI journal, an editorial board member of MLJ, JAIR, and DAMI journals and is the electronics publishing editor of JAIR. Keynote Talk 3 Data Intelligence in the Age of Accountability Pieter De Leenheer Senior Research Fellow, Harvard Business School Co-Founder and Chief Science Officer, Collibra Inc. pdeleenheer@hbs.edu Abstract: Knowledge graphs, machine learning and distributed ledgers are just a few of the emerging intelligent technologies that unlock new options to innovate business models, augment scientific knowledge and self-understanding, and enhance decision making. Data being a critical driver for intelligent systems implies machine calculation may supplant human decision making in many scenarios. The accessibility, quality and currency of data are necessary criteria to ensure these systems produce viable innovation options that can be accounted for. But are these criteria sufficient? Bio: Pieter is a senior research fellow at Harvard Business School and serves as adjunct faculty at Columbia University. He is a cofounder and former Chief Science Officer of Collibra, a unicorn venture in data intelligence, that spun off his PhD research on community-based ontology management. Pieter writes, teaches and advises on computing and management aspects of data innovation, accountability and citizenship. He serves as an expert to the European Commission and several governments; and as board member of several startups such as Gluetech.com and Yesse.tech. Prior to cofounding the company, Pieter was a professor at VU University of Amsterdam. He lives in New York City with his family. Keynote Talk 4 Semantics in Biomedical Data Science Jose Luis Ambite Research Team Leader, Information Sciences Institute Associate Research Professor, University of Southern California ambite@isi.edu Abstract: There is an explosion of biomedical data that promises to enable novel discoveries, treatments, and the ultimate goal of personalized medicine. These data are generated in a great variety of forms, ranging from sensor data, to imaging, to genetics, and all types of clinical data. Moreover, the data are often scattered across organizations, and even for the same data type are represented in diverse structures. Thus, the need to provide a semantically consistent view, so that the data can be meaningfully analyzed is critical. I will describe core data integration and knowledge graph construction techniques, namely entity linkage and formal schema mappings, with illustrative biomedical data integration applications, highlighting some novel neural semantic similarity methods and some surprising applications of record linkage techniques, such as efficiently finding genetically related individuals. I will discuss architectures for large scale data integration and analysis, including sensor data. Finally, I will discuss how we can analyze distributed datasets when the data cannot be shared for privacy or security reasons, and thus cannot be integrated. I will describe our recent work on Heterogeneous Federated Learning that learns common neural models from siloed data. Bio: Dr. Jose Luis Ambite is an Associate Research Professor at the Computer Science Department, and a Research Team Leader at the Information Sciences Institute, at the University of Southern California. His core expertise is on information integration, including query rewriting under constraints, learning schema mappings, and entity linkage. Dr. Ambite research interests include databases, knowledge representation, semantic web, semantic similarity, scientific workflows, and biomedical data science. He has published widely in these topics. He regularly serves as reviewer for funding organizations, journals and major conferences. In the last years, he has focused on developing novel approaches for integration, analysis, and dissemination of biomedical and genetic data within several large NIH-funded projects, such as PRISMS-study, NIMH Repository and Genetics Resource, SchizConnect, Population Architecture using Genomics and Epidemiology, and Education Resource Discovery Index. Textual Evidence for the Perfunctoriness of Independent Medical Reviews Adrian Brasoveanu Megan Moodie Rakshit Agrawal abrsvn@ucsc.edu mmoodie@ucsc.edu ragrawal@camio.com University of California Santa Cruz University of California Santa Cruz Camio Inc. Santa Cruz, CA Santa Cruz, CA San Mateo, CA ABSTRACT ACM Reference Format: We examine a database of 26,361 Independent Medical Reviews Adrian Brasoveanu, Megan Moodie, and Rakshit Agrawal. 2020. Textual Ev- idence for the Perfunctoriness of Independent Medical Reviews. In Proceed- (IMRs) for privately insured patients, handled by the California ings of KDD Workshop on Knowledge-infused Mining and Learning (KiML’20). Department of Managed Health Care (DMHC) through a private , 9 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn contractor. IMR processes are meant to provide protection for pa- tients whose doctors prescribe treatments that are denied by their health insurance (either private insurance or the insurance that is 1 INTRODUCTION part of their worker comp; we focus on private insurance here). 1.1 Origin and structure of IMRs Laws requiring IMR were established in California and other states Independent Medical Review (IMR) processes are meant to provide because patients and their doctors were concerned that health in- protection for patients whose doctors prescribe treatments that are surance plans deny coverage for medically necessary services. We denied by their health insurance – either private insurance or the analyze the text of the reviews and compare them closely with a insurance that is part of their workers’ compensation. In this paper, sample of 50000 Yelp reviews [19] and the corpus of 50000 IMDB we focus exclusively on privately insured patients. Laws requiring movie reviews [10]. Despite the fact that the IMDB corpus is twice IMR processes were established in California and other states in as large as the IMR corpus, and the Yelp sample contains almost the late 1990s because patients and their doctors were concerned twice as many reviews, we can construct a very good language that health insurance plans deny coverage for medically necessary model for the IMR corpus using inductive sequential transfer learn- services to maximize profit.1 ing, specifically ULMFiT [8], as measured by the quality of text As aptly summarized in [1], IMR is regularly used to settle dis- generation, as well as low perplexity (11.86) and high categorical putes between patients and their health insurers over what is medi- accuracy (0.53) on unseen test data, compared to the larger Yelp cally necessary or experimental/investigational care. Medical ne- and IMDB corpora (perplexity: 40.3 and 37, respectively; accuracy: cessity disputes occur between health plans and patients because 0.29 and 0.39). We see similar trends in topic models [17] and clas- the health plan disagrees with the patient’s doctor about the ap- sification models predicting binary IMR outcomes and binarized propriate standard of care or course of treatment for a specific sentiment for Yelp and IMDB reviews. We also examine four other condition. Under the current system of managed care in the U.S., corpora (drug reviews [6], data science job postings [9], legal case services rendered by a health care provider are reviewed to de- summaries [5] and cooking recipes [11]) to show that the IMR re- termine whether the services are medically necessary, a process sults are not typical for specialized-register corpora. These results referred to as utilization review (UR). UR is the oversight mech- indicate that movie and restaurant reviews exhibit a much larger anism through which private insurers control costs by ensuring variety, more contentful discussion, and greater attention to detail that only medically necessary care, covered under the contractual compared to IMR reviews, which points to the possibility that a terms of a patient’s insurance plan, is provided. Services that are crucial consumer protection mandated by law fails a sizeable class not deemed medically necessary or fall outside a particular plan of highly vulnerable patients. are not covered. Procedures or treatment protocols are deemed experimental or CCS CONCEPTS investigational because the health plan – but not necessarily the • Computing methodologies → Latent Dirichlet allocation; patient’s doctor, who in many cases has enough clinical confidence Neural networks. in a treatment to order it – considers them non-routine medical care, or takes them to be scientifically unproven to treat the specific KEYWORDS condition, illness, or diagnosis for which their use is proposed. AI for social good, state-managed medical review processes, It is important to realize that the IMR process is usually the language models, topic models, sentiment classification third and final stage in the medical review process. The typical progression is as follows. After in-person and possibly repeated examination of the patient, the doctor recommends a treatment, In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, 1 For California, see the Friedman-Knowles Act of 1996, requiring California health California, USA, August 24, 2020. Use permitted under Creative Commons License plans to provide external independent medical review (IMR) for coverage denials. As Attribution 4.0 International (CC BY 4.0). of late 2002, 41 states and the District of Columbia had passed legislation creating an KiML’20, August 24, 2020, San Diego, California, USA, IMR process. In 34 of these states, including California, the decision resulting from the © 2020 Copyright held by the author(s). IMR is binding to the health plan. See [1, 15] for summaries of the political and legal https://doi.org/10.1145/nnnnnnn.nnnnnnn history of the IMR system, and [2] for an early partial survey of the DMHC IMR data. KiML’20, August 24, 2020, San Diego, California, USA, Brasoveanu, Moodie and Agrawal which is then submitted for approval to the patient’s health plan. to maximize profit, rather than simply maintain cost effectiveness, If the treatment is denied in this first stage, both the doctor and seems to emerge. Typically, the argument for denial contends that the patient may file an appeal with the health plan, which triggers the evidence for the beneficial effects of the treatment fails the a second stage of reviews by the health-insurance provider, for prevailing standard of scientific evidence. This prevailing standard which a patient can supply additional information and a doctor invoked by IMR reviewers is usually randomized control trials may engage in what is known as a “peer to peer” discussion with a (RCTs), which are expensive, time-consuming trials that are run by health-insurance representative. If these second reviews uphold the large pharmaceutical companies only if the treatment is ultimately initial denial, the only recourse the patient has is the state-regulated estimated to be profitable. IMR process, and per California law, an IMR grievance form (and RCTs, however, have known limits: they “require minimal as- some additional information) is included with the denial letter. sumptions and can operate with little prior knowledge [which] is An IMR review must be initiated by the patient and submitted to an advantage when persuading distrustful audiences, but it is a the California Department of Managed Health Care (DMHC), which disadvantage for cumulative scientific progress, where prior knowl- manages IMRs for privately-insured patients. Motivated treating edge should be built upon, not discarded.” [3] Inflexibly applying physicians may provide statements of support for inclusion in the the RCT “gold standard” in the IMR process is often a way to ig- documentation provided to DMHC by the patient, but in theory nore the doctors’ knowledge and experience in a way that seems the IMR creates a new relationship of care between the review- superficially well-reasoned and scientific. “RCTs can play a role in ing physician(s) hired by a private contractor on behalf of DMHC, building scientific knowledge and useful predictions” – and we add, and the patient in question. The reviewing physicians’ decision is treatment recommendations – “only [. . . ] as part of a cumulative supposed to be made based on what is in the best interest of the pa- program, [in combination] with other methods.” [3] tient, not on cost concerns. It is this relation of care that constitutes Notably, the experimental/investigational category of treatments the consumer protection for which IMR processes were legislated. that get denied often includes promising treatments that have not Understandably, given that the patients in question may be ill or been fully tested in clinical RCTs – because the treatment is new or disabled or simply discouraged by several layers of cumbersome the condition is rare in the population, so treatment development bureaucratic processes, there is a very high attrition from the initial costs might not ultimately be recovered. Another common category review to the final, IMR, stage. That is, only the few highly moti- of experimental/investigational denials involves “off-label” drug vated and knowledgeable patients – or the extremely desperate – uses, that is, uses of FDA-approved pharmaceuticals for a purpose get as far as the IMR process. other than the narrow one for which the drug was approved. The IMR process is regulated by the state, but it is actually con- ducted by a third party. At this time (2019), the provider in Cali- 1.2 Main argument and predictions fornia and several other states across the US is MAXIMUS Federal Recall that these ‘experimental’ treatments or off-label uses are rec- Services, Inc.2 The costs associated with the IMR review, at least ommended by the patient’s doctor, and therefore their potential in California, are covered by health insurers. It is DMHC’s and benefits are taken to outweigh their possible negative effects. The MAXIMUS’s responsibility to collect all the documentation from recommending doctor is likely very familiar with the often lengthy, the patient, the patient’s doctor(s) and the health insurer. There tortuous and highly specific medical history of the patient, and with are no independent checks that all the documentation has actually the list of ‘less experimental’ treatments that have been proven been collected, however, and patients do not see a final list of what unsuccessful or have been removed from consideration for patient- has been provided to the reviewer prior to the IMR decision itself specific reasons. It is also important to remember that many rare (a post facto list of file contents is mailed to patients along with the conditions have no “on-label” treatment options available, since ex- final, binding, decision; it is unclear what recourse a patient may pensive RCTs and treatment approval processes are not undertaken have if they find pertinent information was missing from the review if companies do not expect to recover their costs, which is likely if file). Once the documentation is assembled, MAXIMUS forwards it the potential ‘market’ is small (few people have the rare condition). to anywhere from one to three reviewers, who remain anonymous, Therefore, our main line of argumentation is as follows. but are certified by MAXIMUS to be appropriately credentialed and knowledgeable about the treatment(s) and condition(s) under • Since IMRs are the final stage in a long bureaucratic process review. The reviewer submits a summary of the case, and also a ra- in which health insurance companies keep denying coverage tionale and evidence in support of their decision, which is a binary for a treatment repeatedly recommended by a doctor as Upheld/Overturned decision about the medical service. IMR review- medically necessary, we expect that the issue of medical ers do not enter a consultative relationship with the patient, doctor necessity is non-trivial when that specific patient and that or health plan – they must render an uphold/overturn decision specific treatment are carefully considered. based solely on the provided medical records. However, as noted • We should therefore expect the text of the IMRs, which justi- above, they are in an implied relationship of care to the patient, a fies the final determination, to be highly individualized and point to which we return in the Discussion section below (§4). argue for that final decision (whether congruent with the While insurance carriers do not provide statistics about the per- health plan’s decision or not) in a way that involves the par- centage of requested treatments that are denied in the initial stage, ticulars of the treatment and the particulars of the patient’s looking at the process as a whole, a pattern of service denial aimed medical history and conditions. Thus, we expect a reasoned, thoughtful IMR to not be highly 2 https://www.maximus.com/capability/appeals-imr generic and templatic / predictable in nature. For instance, legal KiML’20, August 24, 2020, San Diego, California, USA, Textual Evidence for the Perfunctoriness of Independent Medical Reviews documents may be highly templatic as they discuss the application The goal in this paper is to investigate to what extent Natu- of the same law or policy across many different cases, but a response ral Language Processing (NLP) / Machine Learning (ML) meth- carefully considering the specifics of a medical case reaching the ods that are able to extract insights from large corpora point in IMR stage is not likely to be similar to many other cases. We only the same direction, thus mitigating cherry-picking biases that are expect high similarity and ‘templaticity’ for IMR reviews if they are sometimes associated with qualitative investigations. In addition reduced to a more or less automatic application of some prespecified to the IMR text, we perform a comparative study with additional set of rules (rubber-stamping). English-language datasets in an attempt to eliminate data-specific and problem-specific biases. • We analyze the text of the IMR reviews and compare them with a sample of 50,000 Yelp reviews [19] and the corpus of 1.3 Main results, and their limits 50,000 IMDB movie reviews [10]. Concomitantly with this quantitative study, we conducted prelim- • As the size of data has significant consequences for language- inary qualitative research with a focus on pain management and model training, and NLP/ML models more generally, we chronic conditions. We investigated the history of the IMR process, expect models trained on the Yelp and IMDB corpora to in addition to having direct experience with it. We had detailed outperform models trained on the IMR corpus, given that conversations with doctors in Northern California and on private the IMDB corpus is twice as large as the IMR corpus, and social media groups formed around chronic conditions and pain the Yelp samples contain almost twice as many reviews. management. This preliminary research reliably points towards the • In this paper, we instead demonstrate that we were able possibility that IMR reviews are perfunctory, and that this crucial to construct a very good language model for the IMR cor- consumer protection mandated by law seems to fail for a sizeable pus using inductive sequential transfer learning, specifically class of highly vulnerable patients. In this paper, we focus on the ULMFiT [8], as measured by the quality of text generation. text of the IMR decisions and attempt to quantify the evidence for • In addition, the model achieves a much lower perplexity the perfunctoriness of the IMR process that they provide. (11.86) and a higher categorical accuracy (0.53) on unseen The text of the IMR findings does not provide unambiguous test data, compared to models trained on the larger Yelp evidence about the quality and appropriateness of the IMR process. and IMDB corpora (perplexity: 40.3 and 37, respectively; If we had access to the full, anonymized patient files submitted to categorical accuracy: 0.29 and 0.39). the IMR reviewers (in addition to the final IMR decision and the • We see similar trends in topic models [17] and classifica- associated text), we might have been able to provide much stronger tion models predicting binary IMR outcomes and binarized evidence that IMRs should have a significantly higher percentage of sentiment for Yelp and IMDB reviews. overturns, and that the IMR process should be improved in various ways, e.g., (i) patients should be able to check that all the relevant These results indicate that movie and restaurant reviews ex- documentation has been collected and will be reviewed, and (ii) hibit a much larger variety, more contentful discussion, and greater the anonymous reviewers should be held to higher standards of attention to detail compared to IMR reviews. In an attempt to mit- doctor-patient care. At the very least, one would want to compare igate confirmation bias, as well as potentially significant register the reports/letters produced by the patient’s doctor(s) and the IMR differences between IMRs and movie or restaurant reviews, we texts. However, such information is not available and there are no examine four additional corpora: drug reviews [6], data science visible signs suggesting potential availability in the near future. job postings [9], legal case summaries [5] and cooking recipes [11]. The information that is made available by DMHC constitutes the These specialized-register corpora are potentially more similar to IMR decision – whether to uphold or overturn the health plan IMRs than IMDB or Yelp: the texts are more likely to be highly decision –, the anonymized decision letter, and information about similar, include boilerplate text and have a templatic/standardized the requested treatment category (also available in the letter). We, structure. We find that predictability of IMR texts, as measured by therefore, had to limit ourselves to the text of the DMHC-provided language-model perplexity and categorical accuracy, is higher than IMR findings in our empirical analysis. all the comparison datasets by a good margin. A qualitative inspection of the corpus of IMR decisions made Based on these empirical comparisons, we conclude that we available by the California DMHC site as of June 2019 (a total of have strong evidence that the IMR reviews are perfunctory and, 26,631 cases spanning the years 2001-2019) indicates that the re- therefore, that a crucial consumer protection mandated by law views – as documented in the text of the findings – focus more seems to fail for a sizeable class of highly vulnerable patients. The on the review procedure and associated legalese than on the ac- paper is structured as follows. In Section 2, we discuss the datasets tual medical history of the patient and the details of the case. For in detail, with a focus on the nature and characteristics of the IMR example, decisions for chronic pain management seem to mostly data. In Section 3, we discuss the models we use to analyze the IMR, rubber-stamp the Medical Treatment Utilization Schedule (MTUS) Yelp and IMDB datasets, as well as the four auxiliary corpora (drug guidelines, with very little consideration of the rarity of the un- reviews, data science jobs, legals cases and recipes). The section also derlying condition(s) (see our comments about RCTs above), or compares and discusses the results of these models. Section 4 puts a thoughtful evaluation of the risk/benefit profile of the denied all the results together into an argument for the perfunctoriness of treatment relative to the specific medical history of the patient the IMRs. Section 5 concludes the paper and outlines directions for (assuming this history was adequately documented to begin with). future work. KiML’20, August 24, 2020, San Diego, California, USA, Brasoveanu, Moodie and Agrawal 2 THE DATASETS Table 2: Outcome counts and percentages by year 2.1 The IMR dataset ReportYear Total # of cases Overturned Upheld The IMR dataset was obtained from the DMHC website in June 2001 28 7 (25%) 21 20193 and was minimally preprocessed. It contains 26,361 cases / 2002 695 243 (35%) 452 observations and 14 variables, 4 of which are the most relevant: 2003 738 280 (38%) 458 2004 788 305 (39%) 483 • TreatmentCategory: the main treatment category; 2005 959 313 (33%) 646 • ReportYear: year the case was reported; 2006 1080 442 (41%) 638 • Determination: indicates if the determination was upheld or 2007 1342 571 (43%) 771 overturned; 2008 1521 678 (45%) 843 • Findings: a summary of the case findings. 2009 1432 641 (45%) 791 The top 14 treatment categories (with percentages of total ≥ 2%), 2010 1453 661 (45%) 792 together with their raw counts and percentages are provided in 2011 1435 684 (48%) 751 2012 1203 589 (49%) 614 Table 1. 2013 1197 487 (41%) 710 2014 1433 549 (38%) 884 Table 1: Top 14 treatment categories 2015 2079 1070 (51%) 1009 2016 3055 1714 (56%) 1341 TreatmentCategory Case count % of total 2017 2953 1391 (47%) 1562 Pharmacy 6480 25% 2018 2545 1218 (48%) 1327 Diag Imag & Screen 4187 16% 2019 425 209 (49%) 216 Mental Health 2599 10% DME 1714 7% Gen Surg Proc 1227 5% Orthopedic Proc 1173 5% Rehab/ Svc - Outpt 1157 4% Cancer Care 1029 4% Elect/Therm/Radfreq 828 3% Reconstr/Plast Proc 825 3% Autism Related Tx 767 3% Emergency/Urg Care 582 2% Diag/ MD Eval 573 2% Pain Management 527 2% Figure 1: % Overturned claimed on DMHC site (June 2019) The breakdown of cases by patient gender (not recorded for all 2.2 The comparison datasets cases) is as follows: Female – 14823 (56%), Male – 10836 (41%), Other As comparison datasets, we use the IMDB movie-review dataset [10], – 11 (0.0004%). which has 50,000 reviews and a binary positive/negative sentiment The breakdown by determination (the outcome of the IMR) is: classification associated with each review. This dataset will be par- Upheld – 14309 (54%), Overturned – 12052 (46%). ticularly useful as a baseline for our ULMFiT transfer-learning The outcome counts and percentages by year are provided in language models (and subsequent transfer-learning classification Table 2. The number of cases for 2019 include only the first 5 months models), where we show that we obtain results for the IMDB dataset of the year plus a subset of June 2019. that are similar to the ones in the original ULMFiT paper [8]. Interestingly, the DMHC website featured a graphic in June 2019 There are 50,000 movie reviews in the IMDB dataset, evenly split (Figure 1) that reports the percentage of Overturned outcomes to be into negative and positive reviews. The histogram of text lengths 64%, a figure that does not accord with any of our data summaries. for IMDB reviews is provided in Figure 2. The reviews contain a We intend to follow up on this issue and see if the DMHC can share total of 11,557,297 words. The mean length of a review is 231.15 their data-analysis pipeline so that we can pinpoint the source(s) words, with an SD of 171.32. of this difference. We select a sample of 50,000 Yelp (mainly restaurant) reviews [19], Given that our main goal here is to investigate the text of the with associated binarized negative/positive evaluations, to provide IMR findings and its predictiveness with respect to IMR outcomes, a comparison corpus intermediate between our DMHC dataset and we provide some general properties of this corpus. The histogram the IMDB dataset. From a total of 560,000 reviews (evenly split be- of word counts for the IMR findings (the text associated with each tween negative and positive), we draw a weighted random sample case) is provided in Figure 2. There are 26,361 texts, with a total of with the weights provided by the histogram of text lengths for the 5,584,280 words. Words are identified by splitting texts on white IMR corpus. The resulting sample contains 25,809 (52%) negative space (sufficient for our purposes here). The mean length of a text reviews and 24,191 (48%) positive reviews. The histogram of text is 211.84 words, with a standard deviation (SD) of 120.58. lengths for Yelp reviews is also provided in Figure 2. The reviews 3 https://data.chhs.ca.gov/dataset/independent-medical-review-imr-determinations- contain a total of 7,038,467 words. The mean length of a review is trend. 140.77 words, with an SD of 71.09. KiML’20, August 24, 2020, San Diego, California, USA, Textual Evidence for the Perfunctoriness of Independent Medical Reviews (Normalized) # of texts of a given length (Normalized) # of texts of a given length (Normalized) # of texts of a given length 0.004 0.008 0.005 0.003 0.004 0.006 0.003 0.002 0.004 0.002 0.001 0.002 0.001 0.000 0.000 0.000 0 200 400 600 800 1000 1200 0 500 1000 1500 2000 2500 0 200 400 600 800 IMR text length (# of words) IMDB-review text length (# of words) Yelp-review text length (# of words) (a) IMR (b) IMDB (c) Yelp Figure 2: Histograms of text lengths (numbers of words per text) for the IMR, IMDB and Yelp corpora (Normalized) # of texts of a given length (Normalized) # of texts of a given length (Normalized) # of texts of a given length (Normalized) # of texts of a given length 0.010 0.0020 0.00014 0.007 0.00012 0.008 0.006 0.0015 0.00010 0.005 0.006 0.00008 0.0010 0.004 0.004 0.00006 0.003 0.0005 0.00004 0.002 0.002 0.00002 0.001 0.000 0.0000 0.00000 0.000 0 250 500 750 1000 1250 1500 1750 0 500 1000 1500 2000 2500 3000 3500 0 20000 40000 60000 80000 0 500 1000 1500 2000 2500 Drug-review text length (# of words) DS-job text length (# of words) Legal-case text length (# of words) Recipe text length (# of words) (a) Drug Reviews (b) DS Jobs (c) Legal cases (d) Recipes Figure 3: Histograms of text lengths (numbers of words per text) for the auxiliary datasets 2.3 Four auxiliary datasets The histogram of text lengths for drug reviews is provided in We will also analyze four other specialized-register corpora: drug Figure 3. The reviews contain a total of 11,015,248 words, with a reviews [6], data science (DS) job postings [9], legal case reports [5] mean length of 83.26 words per review (significantly shorter than and cooking recipes [11]. The modeling results for these specialized- the IMR/IMDB/Yelp texts) and an SD of 45.73. register corpora will enable us to better contextualize and evaluate The DS corpus includes 6,953 job postings (about a quarter of the modeling results for the IMR, IMDB and Yelp corpora, since the texts in the IMR corpus), with a total of 3,731,051 words. The these four auxiliary datasets might be seen as more similar to the histogram of text lengths is provided in Figure 3. The mean length IMR corpus than movie or restaurant reviews. The drug-review of a job posting is 536.61 words (more than twice as long as the corpus contains reviews of pharmaceutical products, which are IMR/IMDB/Yelp texts), with an SD of 254.06. closer in subject matter to IMRs than movie/restaurant reviews. There are 3,890 legal-case reports (even fewer than DS job post- The other three corpora are all highly specialized in register, just ings), with a total of 25,954,650 words (about 5 times larger than like the IMRs, with two of them (DS jobs and legal cases) particularly the IMR corpus). The histogram of text lengths for the legal-case re- similar to the IMRs in that they involve templatic texts containing ports is provided in Figure 3. The mean length of a report is 6,672.15 information aimed at a specific professional sub-community. words (a degree of magnitude longer than IMR/IMDB/Yelp), with a These four corpora are very different from each other and from very high SD of 11,997.98. the IMR corpus in terms of (i) the number of texts that they contain Finally, the recipe corpus includes more than 1 million texts: and (ii) the average text length (number of words per text). Because there are 1,029,719 recipes, with a total of 117,563,275 words (very of this, there was no obvious way to sample from them and from large compared to our other corpora). The histogram of text lengths the IMR, IMDB and Yelp corpora in such a way that the resulting for the recipes is provided in Figure 3. The mean length of a recipe samples were both roughly comparable with respect to the total is 114.17 words (close to the length of a drug review, and roughly number of texts and average text length, and also large enough to half of an IMR), with an SD of 90.54. obtain reliable model estimates. We therefore analyzed these four corpora as a whole. 3 THE MODELS The drug-review corpus includes 132,300 drugs reviews – more In this section, we analyze the text of the IMR findings and its than the double the number of texts in the IMDB and Yelp datasets, predictiveness with respect to IMR outcomes. We systematically and more than 4 times the number of texts in the IMR dataset. From compare these results with the corresponding ones for the IMDB the original corpus of 215,063 reviews, we only retained the reviews and Yelp corpora. The datasets were split into training (80%), vali- associated with a rating of 10, which we label as positive reviews, dation (10%) and test (10%) sets. Test sets were only used for the and a rating of 1 through 5, which we label as negative reviews.4 final model evaluation. 4 We did this so that we have a fairly balanced dataset (68,005 positive drug reviews and 64,295 negative reviews) to estimate classification models like the ones we report for the IMR, IMDB and Yelp corpora in the next section. For completeness, the drug-review accuracy: 77.89%; accuracy of multilayer perceptron with a 1,000-unit hidden layer classification results on previously unseen test data are as follows: logistic regression and a ReLU non-linearity: 83.18%; ULMFiT classification model accuracy: 96.12%. KiML’20, August 24, 2020, San Diego, California, USA, Brasoveanu, Moodie and Agrawal We start with baseline classification models (logistic regressions We see that the text of the findings / reviews is highly predictive and logistic multilayer perceptrons with one hidden layer) to es- of the associated binary outcomes, with the highest accuracy for the tablish that the reviews in all three datasets under consideration IMR dataset despite the fact that it contains half the observations are highly predictive of the associated binary outcomes. Once the of the other two data sets. We can therefore turn to a more in- predictiveness, hence, relevance, of the text is established, we turn depth analysis of the texts to understand what kind of textual to an in-depth analysis of the texts themselves by means of topic justification is used to motivate the IMR binary decisions. To that and language models. We see that the text of the IMR reviews is end, we examine and compare the results of two unsupervised/self- significantly different (more predictable, less diverse / contentful) supervised types of models: topic models and language models. when compared to movie and restaurant reviews. We then turn to a final set of classification models that leverage transfer learning 3.2 Topic models from the language models to see how predictive the texts can re- Topic modeling [17] is an unsupervised method that distills se- ally be with respect to the associated binary outcomes. Finally, we mantic properties of words and documents in a corpus in terms of report the results of estimating language models for the 4 auxiliary probabilistic topics. The most widespread measure for topic model datasets introduced in the previous section. evaluation is the coherence score [14]. Typically, as we increase The main conclusion of this extensive series of models is that the number of topics from very few, say, 4 topics, to more of them, the IMR corpus is an outlier, and it would be easy to make the we see an increase in coherence score that tends to level out after IMR process fully automatic: it is pretty straightforward to train a certain number of topics. When modeling the IMDB and Yelp models that generate high-quality, realistic IMR reviews and gen- datasets, we see exactly this behavior, as shown in Figure 4. erate binary decisions that are very reliably associated with these In contrast, the 4-topic model has the highest coherence score reviews. In contrast, movie and restaurant reviews produced by (0.56) for the IMR data set, also shown in Figure 4. Furthermore, unpaid volunteers (as well as the 4 auxiliary datasets) exhibit more as we add more topics, the coherence score drops. As the word human-like depth, sophistication and attention to detail, so current clouds for the 4-topic model in Figure 5 show, these 4 topics mostly NLP models do not perform as well on them. reflect the legalese associated with the IMR review procedure and very little, if anything, of the treatments and conditions that were 3.1 Classification models the main point of the review. In contrast, the corresponding high- We regress outcomes (Upheld/Overturned for IMR or negative/positive scoring topic models for the IMDB and Yelp datasets reflect actual sentiment for IMDB/Yelp) against the text of the corresponding features of movies, e.g., family-life movies, westerns, musicals etc., findings / reviews. For the purposes of these basic classification or breakfast/lunch places, restaurants, shops, bars, hotels etc. models, as well as the topics models discussed in the following sub- Recall that IMRs are the legally-mandated last resort for patients section, the texts were preprocessed as follows. First, we removed seeking treatments (usually) ordered by their doctors, and which stop words; for the IMR dataset, we also removed the following their health plan refuses to cover. The reviews are conducted ex- high-frequency words: patient, treatment, reviewer, request, medi- clusively based on documentation. Putting aside the fact that it is cal and medically, and for the IMDB dataset, we also removed the unclear how much effort is taken to ensure that the documentation words film and movie. After part-of-speech tagging, we retained is complete, especially for patients with extensive and complicated only nouns, adjectives, verbs and adverbs, since lexical meanings health records, we see that relatively little specific information provide the most useful information for logistic (more generally, about a patients’ medical history, condition(s), or the recommended feed-forward) models and topic models. The resulting dictionary treatments are reflected in the text of these decisions. The text seems for the IMR dataset had 23,188 unique words. We ensured that to consist largely of legalese about the IMR process, the health plan the dictionaries for the IMDB and Yelp datasets were also between / providers, basic demographic information about the patient, and 23,000 and 24,000 words by eliminating infrequent words. Bounding generalities about the medical service or therapy requested for the the dictionaries for each dataset to a similar range helps mitigate enrollee’s condition. dataset-specific modeling biases: having differently-sized vocabu- laries leads to differently-sized parameter spaces for the models. 3.3 Language models with transfer learning We extracted features by converting each text into sparse bag-of- words vectors of dictionary length, which recorded how many times Language models, specifically using neural networks, are usually each token occurred in the text. These feature representations were recurrent-network or transformer based architectures designed the input to all the classifier models we consider in this subsection. to learn textual distributional patterns in an unsupervised or self- The multilayer perceptron model had a single hidden layer with supervised manner. Recurrent-network models – on which we 1,000 units and a ReLU non-linearity. The classification accuracies focus here – commonly use Long Short-Term Memory (LSTM) [7] on the test data for all three datasets are provided in Table 3. “cells,” which are able to learn long-term dependencies in sequences. Representing text as a sequence of words, language models build rich representations of the words, sentences, and their relations Table 3: Classification accuracy for basic models within a certain language. We estimate a language model for the IMR corpus using inductive sequential transfer learning, specifically IMR IMDB Yelp ULMFiT [8]. Just as [8], we use the AWD-LSTM model [12], a vanilla logistic regression 90.75% 86.30% 87.62% LSTM with 4 kinds of dropout regularization, embedding size of multilayer perceptron 90.94% 87.14% 88.92% 400, 3 LSTM layers (1,150 units per layer), and a BPTT of size 70. KiML’20, August 24, 2020, San Diego, California, USA, Textual Evidence for the Perfunctoriness of Independent Medical Reviews Coherence scores 0.40 Coherence scores 0.510 Coherence scores 0.55 0.505 0.38 0.500 0.54 Coherence score Coherence score Coherence score 0.495 0.36 0.490 0.53 0.485 0.52 0.34 0.480 0.475 0.51 0.32 0.470 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 Num Topics Num Topics Num Topics (a) IMR (b) IMDB (c) Yelp Figure 4: Coherence scores for topic models (𝑥-axis: number of topics; 𝑦-axis: coherence score) for treatment of the patient ’s behavioral health condition . The American Psychiatric Association ( APA ) treatment guidelines for patients with eating disorders also consider PHP acute care to be the most appropriate setting for treat- ment , and suggest that patients should be treated in the least restrictive setting which is likely to be safe and effective . The PHP was initially recommended for patients who were based on their own medical needs , but who were • The patient was admitted to a skilled nursing facility ( SNF ) on 12 / 10 / 04 . The submitted documentation states the patient was discharged from the hospital on 12 / 22 / 04 . The following day the patient ’s vital signs were sta- ble . The patient had been ambulating to the community with assistance with transfers , but has not had any recent medical or rehabilitation therapy . The patient had no new medical problems and was discharged in stable condition . Figure 5: Word clouds for the 4-topic IMR model The patient has requested reimbursement for the inpatient acute rehabilitation services provided We see that the IMR language model is highly performant, de- The AWD-LSTM model is pretrained on Wikitext-103 [13], con- spite the simple model architecture we used, the modest size of sisting of 28, 595 preprocessed Wikipedia articles, with a total of 103 the pretraining corpus, and the small size of the IMR corpus. The million words. This pretrained model is fairly simple (no attention, quality of the generated text is also very high, particularly given skip connections etc.), and the pretraining corpus is of modest size. all these limitations. To obtain our final language models for the IMR, IMDB and Yelp corpora, we fine-tune the pretrained AWD-LSTM model using discriminative [18] and slanted triangular [8, 16] learning rates. We 3.4 Classification with transfer learning do the same kind of minimal text preprocessing as in [8]. We further fine-tune the language models discussed in the previous The perplexity and categorical accuracy for the 3 language mod- subsection to train classifiers for the three datasets. Following [4, 8], els are provided in Table 4. The perplexity for the IMR findings is we gradually unfreeze the classifier models to avoid catastrophic much lower than for the IMDB / Yelp reviews, and the language forgetting. model can correctly guess the next word more than half the time. The results of evaluating the classifiers on the withheld test sets are provided in Table 5. Despite the fact that the IMR dataset Table 4: Language-model perplexity and categ. accuracy contains half of the classification observations of the other two datasets, we obtain the highest level of accuracy when predicting IMR IMDB Yelp binary Upheld/Overturned decisions based on the text of the IMR perplexity 11.86 36.96 40.3 findings. categorical accuracy 53% 39% 29% Table 5: Accuracy for transfer-learning classifiers The IMR language model can generate high quality and largely coherent text, unlike the IMDB / Yelp models. Two samples of IMR IMDB Yelp generated text are provided below (the ‘seed’ text is boldfaced). classification accuracy 97.12% 94.18% 96.16% • The issue in this case is whether the requested partial hos- pitalization program ( PHP ) services are medically necessary KiML’20, August 24, 2020, San Diego, California, USA, Brasoveanu, Moodie and Agrawal Table 6: Comparison of language models across all datasets. 40 Best performing metrics are boldfaced. 50 35 Categorical accuracy (%) 45 Dataset Perplexity Categorical Accuracy 30 Perplexity IMR reviews 11.86 0.53 25 40 Legal cases 18.17 0.43 DS Jobs 22.14 0.41 20 35 Drug reviews 25.06 0.36 15 30 Recipes 29.56 0.39 IMDB 36.96 0.39 R s s es DB lp s se iew job IM Ye cip IM ca rev DS Re Yelp 40.3 0.29 l ga ug Le Dr Corpora 3.5 Models for auxiliary corpora Figure 6: Comparison of language-model perplexity and cat- We also estimated topic and language models for the 4 auxiliary egorical accuracy across all the datasets. corpora (drug reviews, DS jobs, legal cases and cooking recipes). The associations between coherence scores and number of topics for these 4 corpora was similar to the ones plotted in Figure 4 above these medical reviews to be so much more predictable and generic for the IMDB and Yelp corpora. For all 4 auxiliary corpora, the best than less socially consequential reviews of movies and restaurants. topic models had at least 14 topics, often more, with coherence What are the ethical and potentially legal consequences of these scores above 0.5. The quality of the topics was also high, with findings? First, while state legislators assume we have strong health- intuitively coherent and contentful topics (just like IMDB / Yelp). insurance related consumer protections in place, an image DMHC The perplexity and accuracy of the ULMFiT language models goes to great lengths to promote, we find the reviews to be up- on previously-withheld test data are provided in Table 6, which holding insurance plan denials at rates that exceed what one might contains the results for all the 7 datasets under consideration in expect, given that the treatments in question are frequently being this paper. We see that the predictability of the IMR corpus, as ordered by a treating physician, and that the IMR process is the last reflected in its perplexity and categorical accuracy scores, is still stage in a bureaucratically laborious (hence high-attrition) process clearly higher than the 4 auxiliary corpora. The perplexity of the of appealing health-plan denials. legal-case corpus (18.17) is somewhat close to the IMR perplexity Second, given that the IMR process creates an implied relation (11.86), but we should remember that the legal-case corpus is about of care between the reviewers hired by MAXIMUS and the patient – 5 times larger than the IMR corpus. Furthermore, the legal-case since reviewers are, after all, being entrusted with the best interests categorical accuracy of 43% is still substantially lower than the IMR of the patient without regard to cost –, one can hardly say that they accuracy of 53%. Notably, even the recipe corpus, which is about 20 are fulfilling their obligations as doctors to their patient with such times larger than the IMR corpus (≈ 117.5 vs. ≈ 5.5 million words) seemingly rote, perfunctory reviews. does not have test-set scores similar to the IMR scores. Third, if IMR processes were designed to make sure that (i) treat- The results for these 4 auxiliary corpora indicate that the IMR ment decisions are being made by doctors, not by profit-driven corpus is an outlier, with very highly templatic and generic texts. businesses, and (ii) insurance companies cannot welch on their re- sponsibilities to plan members, one must wonder whether prescrib- 4 DISCUSSION ing physicians are wrong more than half the time. Do American The models discussed in the previous section show that language- doctors really order so many erroneous, medically unnecessary model learning is significantly easier for IMRs compared to the other treatments and medications? If so, how is it possible that they are 6 corpora. As can be seen in Table 6, perplexity in the language so committed and confident in them that they are willing to escalate model for IMR reviews is clearly lower than even legal cases, for the appeal process all the way to the state-managed IMR stage? which we expect highly templatic language and high similarity Or is it that IMRs often serve as a final rubber stamp for health- between texts. This pattern can be clearly observed in Figure 6, insurance plan denials, failing their stated mission of protecting a with the IMR corpus clearly at the very end of the high-to-low vulnerable population? predictability spectrum. We end this discussion section by briefly reflecting on the way One would not expect such highly predictable texts in an ideal we used ML/NLP methods for social good problems in this paper. scenario, where each medical review is thorough, and each deci- Overwhelmingly, the social-good applications of these methods sion is accompanied by strong medical reasoning relying on the and models seem to be predictive in nature: their goal is to improve specifics of the case at hand, and based on an objective physician’s, the outcomes of a decision-making process, and the improvement or team of physicians’, opinion as to what is in the patient’s best is evaluated according to various performance-related metrics. An interest. Arguably, these medically complex cases are as diverse as important class of metrics that are currently being developed have Hollywood blockbusters or fashionable restaurants – the patients to do with ethical, or ‘safe,’ uses of ML/AI models. themselves certainly experience them as unique and meaningful In contrast, our use of ML models in this paper was analytical, –, and their reviews should be similarly diverse, or at most as tem- with the goal of extracting insights from large datasets that enable platic as a job posting or a cooking recipe. We wouldn’t expect us to empirically evaluate how well an established decision-making KiML’20, August 24, 2020, San Diego, California, USA, Textual Evidence for the Perfunctoriness of Independent Medical Reviews process with high social impact functions. Data analysis of this limited to (i) adding ways for patients to check that all the rele- kind, more akin to hypothesis testing than to predictive modeling, vant documentation has been collected and will be reviewed, and is in fact one of the original uses of statistical models / methods. (ii) identifying ways to hold the anonymous reviewers to higher Unfortunately, using ML models in this way does not straightfor- standards of doctor-patient care. wardly lead to plots showing how ML models obviously improve metrics like the efficiency or cost of a process. We think, however, ACKNOWLEDGMENTS that there are as many socially beneficial opportunities for this kind We are grateful to four KDD-KiML anonymous reviewers for their of data-analysis use of ML modeling as there are for its predictive comments on an earlier version of this paper. We gratefully acknowl- uses. The main difference between them seems to be that the data- edge the support of the NVIDIA Corporation with the donation of analysis uses do not lead to more-or-less immediately measurable two Titan V GPUs used for this research, as well as the UCSC Office products. Instead, they are meant to become part of a larger ar- of Research and The Humanities Institute for a matching grant to gument and evaluation of a socially and politically relevant issue, purchase additional hardware. The usual disclaimers apply. e.g., the ethical status of current health-insurance related practices and consumer protections discussed here. What counts as ‘success’ REFERENCES when ML models are deployed in this way is less immediate, but [1] Leatrice Berman-Sandler. 2004. Independent Medical Review: Expanding Legal could provide at least as much social good in the long run. Remedies to Achieve Managed Care Accountability. Annals Health Law 13 (2004). [2] Kenneth H. Chuang, Wade M. Aubry, and R. Adams Dudley. 2004. Independent Medical Review Of Health Plan Coverage Denials: Early Trends. Health Affairs 23, 6 (2004), 163–169. https://doi.org/10.1377/hlthaff.23.6.163 5 CONCLUSION AND FUTURE WORK [3] Angus Deaton and Nancy Cartwright. 2018. Understanding and misunderstand- ing randomized controlled trials. Social Science and Medicine 210 (2018), 2–21. We examined a database of 26,361 IMRs handled by the California https://doi.org/10.1016/j.socscimed.2017.12.005 DMHC through a private contractor. IMR processes are meant to [4] Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. 2017. Using millions of emoji occurrences to learn any-domain representations for provide protection for patients whose doctors prescribe treatments detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on that are denied by their health insurance. Empirical Methods in Natural Language Processing. Association for Computational We found that, in a majority of cases, IMRs uphold the health Linguistics, Copenhagen, Denmark, 1615–1625. https://doi.org/10.18653/v1/D17- 1169 insurance denial, despite DMHC’s claim to the contrary. In addition, [5] Filippo Galgani and Achim Hoffmann. 2011. LEXA: Towards Automatic Legal we analyzed the text of the reviews and compared them with a Citation Classification. In AI 2010: Advances in Artificial Intelligence, Jiuyong Li sample of 50,000 Yelp reviews and the IMDB movie review corpus. (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 445–454. [6] Felix Gräundefineder, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. Despite the fact that these corpora are basically twice as large, we 2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain can construct a very good language model for the IMR corpus, and Cross-Data Learning (DH ’18). Association for Computing Machinery, New York, NY, USA, 121–125. https://doi.org/10.1145/3194658.3194677 as measured by the quality of text generation, as well as its low [7] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. perplexity and high categorical accuracy on unseen test data. These Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9. results indicate that movie and restaurant reviews exhibit a much 8.1735 [8] Jeremy Howard and Sebastian Ruder. 2018. Fine-tuned Language Models for Text larger variety, more contentful discussion, and greater attention Classification. CoRR abs/1801.06146 (2018). arXiv:1801.06146 http://arxiv.org/ to detail compared to IMR reviews, which seem highly templatic abs/1801.06146 and perfunctory in comparison. We see similar trends in topic [9] Shanshan Lu. 2018. Data Scientist Job Market in the U.S. https://www.kaggle. com/sl6149/data-scientist-job-market-in-the-us More info available here: https: models and classification models predicting binary IMR outcomes //github.com/Silvialss/projects/tree/master/IndeedWebScraping. and binarized sentiment for Yelp and IMDB reviews. [10] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis (HLT These results were further confirmed by topic and language ’11). Association for Computational Linguistics, Stroudsburg, PA, USA, 142–150. models for four other specialized-register corpora (drug reviews, [11] Javier Marin, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, Yusuf data science job postings, legal-case reports and cooking recipes). Aytar, Ingmar Weber, and Antonio Torralba. 2019. Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images. IEEE We are in the process of extending our datasets with (i) workers’ Trans. Pattern Anal. Mach. Intell. (2019). comp cases from California and (ii) private insurance cases from [12] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing other states. This will enable us to investigate if the reviews for and Optimizing LSTM Language Models. CoRR abs/1708.02182 (2017). [13] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. workers’ comp cases are substantially different from the DMHC Pointer Sentinel Mixture Models. CoRR abs/1609.07843 (2017). IMR data (the percentage of upheld decisions is much higher for [14] Michael Röder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the Space of Topic Coherence Measures (WSDM ’15). ACM, New York, NY, USA, workers’ comp: ≈ 90%), as well as if the reviews vary substantially 399–408. https://doi.org/10.1145/2684822.2685324 across states. [15] Shirley Eiko Sanematsu. 2001. Taking a broader view of treatment disputes Another direction for future work is to follow up on our pre- beyond managed care: Are recent legislative efforts the cure? UCLA Law Review 48 (2001). liminary qualitative research with a survey of patients that have [16] Leslie N. Smith. 2017. Cyclical learning rates for training neural networks. In experienced the IMR process to see if these patients agree with the Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE. DMHC-promoted message that the IMR process provides strong 464–472. [17] Mark Steyvers and Tom Griffiths. 2007. Probabilistic Topic Models. Lawrence consumer protection against unjustified health-plan denials. This Erlbaum Associates. could also enable us to verify if the medical documentation col- [18] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transfer- able are features in deep neural networks?. In Advances in Neural Information lected during the IMR process is complete and actually taken into Processing Systems. 3320–3328. account when the decision is made. [19] Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Con- The ultimate upshot of this project would be a list of recommen- volutional Networks for Text Classification. CoRR abs/1509.01626 (2015). arXiv:1509.01626 http://arxiv.org/abs/1509.01626 dations for the improvement of the IMR process, including but not Knowledge Intensive Learning of Generative Adversarial Networks Devendra Singh Dhami Mayukh Das Sriraam Natarajan devendra.dhami@utdallas.edu Samsung Research India The University of Texas at Dallas The University of Texas at Dallas mayukh.das@samsung.com sriraam.natarajan@utdallas.edu ABSTRACT We aim to address the above limitations. Inspired by Mitchell’s While Generative Adversarial Networks (GANs) have accelerated argument of “The Need for Biases in Learning Generalizations” [38], the use of generative modelling within the machine learning com- we mitigate the challenges of existing data hungry methods via in- munity, most of the applications of GANs are restricted to images. ductive bias while learning GANs. We show that effective inductive The use of GANs to generate clinical data has been rare due to the bias can be provided by humans in the form of domain knowl- inability of GANs to faithfully capture the intrinsic relationships edge [14, 27, 41, 50]. Rich human advice can effectively balance between features. We hypothesize and verify that this challenge can the impact of quality (sparsity) of training data. Data quality also be mitigated by incorporating domain knowledge in the generative contributes to, the well studied, modal instability of GANs. This process. Specifically, we propose human-allied GANs that using problem is especially critical in domains such as medical/clinical correlation advice from humans to create synthetic clinical data. Our analytics that does not typically exhibit ‘spatial homophily’ [21], un- empirical evaluation demonstrates the superiority of our approach like images, and are prone to distributional diversity among feature over other GAN models. clusters as well. Our human-guided framework proposes a robust strategy to address this challenge. Note that in our setting the human CCS CONCEPTS is an ally and not an adversary. The second limitation of access is crucial for medical data gener- • Deep Learning → Generative Adversarial Networks; • Ap- ation. Access to existing medical databases [10, 18] is hard due to plication → Healthcare; • Learning → Knowledge Intensive Learn- cost and access concerns and thus synthetic data generation holds ing. tremendous promise [6, 13, 19, 35, 48]. While previous methods KEYWORDS generated synthetic images, we go beyond images and generate clin- generative adversarial networks, human in the loop, healthcare ical data. Building on this body of work, we present a synthetic data ACM Reference Format: generation framework that effectively exploits domain expertise to Devendra Singh Dhami, Mayukh Das, and Sriraam Natarajan. 2020. Knowl- handle data quality. edge Intensive Learning of Generative Adversarial Networks. In Proceedings We make a few key contributions: of KDD Workshop on Knowledge-infused Mining and Learning (KiML’20). , 6 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn (1) We demonstrate how effective human advice can be provided to a GAN as an inductive bias. 1 INTRODUCTION (2) We present a method for generating data given this advice. (3) Finally, we demonstrate the effectiveness and efficacy of our Deep learning models have reshaped the machine learning landscape approach on 2 de-identified clinical data sets. Our method over the past decade [16, 29]. Specifically, Generative Adversar- is generalizable to multiple modalities of data and is not ial Networks (GANs) [17] have found tremendous success in gen- necessarily restricted to images. erating examples for images [34, 37, 45], photographs of human (4) Yet another feature of this approach is that training occurs faces [1, 25, 52], image to image translation [30, 33, 55] and 3D from very few data samples (< 50 in one domain) thus pro- object generation [44, 51, 53] to name a few. Despite such success, viding human guidance as a data generation alternative. there are several key factors that limit the widespread adoption of GANs, for a broader range of tasks, including, widely acknowledged data hungry nature of such methods, potential access issues of real 2 RELATED WORK medical data and finally, their restricted usage, mainly in the con- The key principle behind GANs [17] is a zero-sum game [26] from text of images. These factors have limited the use of these arguably game theory, a mathematical representation where each participant’s successful techniques in medical (or similar) domains. However, gain or loss is exactly balanced by the losses or gains of the other recently, synthetic data generation has become a centerpiece of re- participants and is generally solved by a minimax algorithm. The search in medical AI due to the diverse difficulties in collection, generator distribution 𝑝𝑑𝑎𝑡𝑎 (𝒙) over the given data 𝒙 is learned by persistence, sharing and analysis of real clinical data. sampling 𝒛 from a random distribution 𝑝 𝒛 (𝒛) (initially uniform was proposed but Gaussians have been proven superior [2]). While GANs In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, have proven to be a powerful framework for estimating generative California, USA, August 24, 2020. Use permitted under Creative Commons License distributions, convergence dynamics of naive mini-max algorithm Attribution 4.0 International (CC BY 4.0). has been shown to be unstable. Some recent approaches, among KiML’20, August 24, 2020, San Diego, California, USA, © 2020 Copyright held by the author(s). many others, augment learning either via statistical relationships be- https://doi.org/10.1145/nnnnnnn.nnnnnnn tween true and learned generative distributions such as Wasserstein-1 KiML’20, August 24, 2020, San Diego, California, USA, Devendra Singh Dhami, Mayukh Das, and Sriraam Natarajan distance [3], MMD [32] or via spectral normalization of the parame- on defining a distance/divergence (Wasserstein or earth movers dis- ter space of the generator [39] which controls the generator distribu- tance) to measure the closeness between the real distribution and the tion from drifting too far. Although these approaches have improved model distribution. the GAN learning in some cases, there is room for improvement. Guidance via human knowledge is a provably effective way to control learning in presence of systematic noise (which leads to 3.1 Human input as inductive bias instability). One typical strategy to incorporate such guidance is Historically, two approaches have been studied for using guidance by providing rules over training examples and features. Some of as bias. The first is to provide advice on the labels as constraints the earliest approaches are explanation-based learning (EBL-NN, or preferences that controls the search space. Some example advice [49]) or ANNs augmented with symbolic rules (KBANN, [50]). Var- rules on the labels include: (3 ≤ feature1 ≤ 5) ⇒ label = 1 and (0.6 ious widely-studied techniques of leveraging domain knowledge ≤ feature2 ≤ 0.8) ∧ (4 ≤ feature3 ≤ 5) ⇒ label = 0. Such advice for optimal model generalization include polyhedral constraints in is more relevant in an discriminative setting but are not ideal for case of knowledge-based SVMs, [9, 14, 28, 47]), preferences rules GANs. Since GANs are shown to be sensitive to the training data [5, 27, 41, 42] or qualitative constraints (ex: monotonicities / syner- and here the labels are getting generated, they should not be altered gies [54] or quantitative relationships [15]). Notably, whereas these during training. The second is via correlations between features as models exhibit considerable improvement with the incorporation of preferences (our approach) which allows for faithful representation human knowledge, there is only limited use of such knowledge in of diverse modality. training GANs. Our approach resembles the qualitative constraints Advice injection: After every fixed number of iterations, N, we framework in spirit. calculate the correlation matrix of the generated data G1 and provide While widely successful in building optimally generalized models a set of advice 𝜓 on the correlations between different features. Con- in presence of systematic noise (or sample biases), knowledge-based sider the following motivating example for the use of correlations as approaches have mostly been explored in the context of discrimi- a form of advice. native modeling. In the generative setting, a recent work extends Example: Consider predicting heart attack with 3 features - choles- the principle of posterior regularization from Bayesian modeling to terol, blood pressure (BP) and income. The values of the given deep generative models in order to incorporate structured domain features can vary (sometimes widely) between different patients due knowledge [22]. Traditionally, knowledge based generative learning to several latent factors (ex, smoking habits). It is difficult to assume has been studied as a part of learning probabilistic graphical models any specific distribution. In other words, it is difficult to deduce with structure/parameter priors [36]. We aim to extend the use of whether the values for the features come from the same distribution knowledge to the generative model setting. (even though the feature values in the data set are similar). We modify the correlation coefficients (for both positive and neg- 3 KNOWLEDGE INTENSIVE LEARNING OF ative correlations) between the features by increasing them if the GENERATIVE ADVERSARIAL NETWORKS human advice suggests that two features are highly correlated and decrease the same if the advice suggests otherwise. A notable disadvantage of adversarial training formulation is that Example: Continuing the above example, since rise in the choles- the training is slow and unstable, leading to mode collapse [2] where terol level can lead to rise in BP and vice versa, expert advice here the generator starts generating data of only a single modality. This can suggest that cholesterol and BP should be highly correlated. has resulted in GANs not being exploited to their full potential in Also, as income may not contribute directly to BP and cholesterol generating synthetic non-image clinical data. Human advice can levels, another advice here can be to de-correlate cholesterol/BP encourage exploration in diverse areas of the feature space and helps and income level. learn more stable models [43]. Hence, we propose a human-allied The example advice rules ∈ 𝜓 are: 1. Correlation(“cholesterol GAN architecture (HA-GAN) (figure 1). The architecture incorpo- level",“BP")↑, 2. Correlation(“cholesterol level",“income level")↓ rates human advice in form of feature correlations. Such intrinsic and 3. Correlation(“BP",“income level")↓, where ↑ and ↓ indicate relationships between the features are crucial in medical data sets increase and decrease respectively. Based on the 1st advice we need and thus become a natural candidate as additional knowledge/advice to increase the correlation coefficient between cholesterol level and in guided model learning for faithful data generation. BP. Then Our approach builds upon a GAN architecture [17] where a ran- dom noise vector is provided to the generator which tries to generate examples as close to the real distribution as possible. The discrimi- 1 0.2 0.3 1 𝜆 1 nator tries to distinguish between real examples and ones generated C = 0.2 1 0.07 A = 𝜆 1 1 (1) by the generator. The generator tries to maximize the probability 0.3 0.07 1 1 1 1 that the discriminator makes a mistake and the discriminator tries to minimize its mistakes thereby resulting in a min-max optimization problem which can be solved by a mini-max algorithm. We adopt Here C is the correlation matrix, A is the advice matrix and 𝜆 is the the Wasserstein GAN (WGAN) architecture1 [3, 20] that focuses factor by which the correlation value is to be augmented. In case where we need to increase the value of the correlation coefficient, 𝜆 should be > 1. We keep 𝜆 = 𝑚𝑎𝑥 1( | C |) . Since -1.0 ≤ ∀𝑐 ∈ C ≤ 1.0, 1 We use ‘GAN’ to indicate ‘W-GAN’ in this case, the value of 𝜆 ≥ 1.0, leading to enhanced correlation via Knowledge Intensive Learning of KiML’20, August 24, 2020, San Diego, California, USA, Generative Adversarial Networks Figure 1: Human-Allied GAN. Correlation advice takes generated distribution closer to the real distribution. Hadamard product. Thus the new correlation matrix Ĉ is, function. For a sampled point 𝑣, CDF (𝑣) = P (𝑉 ≤ 𝑣). Thus, to 1 1 generate samples, the values 𝑣 ∼ V are passed through CDF −1 to 0.2 0.3 1 1 1 0.3 obtain the desired values 𝑥 [CDF −1 (𝑣) = {𝑥 |CDF (𝑥) ≤ 𝑣, 𝑣 ∈ Ĉ = C ⊙ A = 0.2 1 0.07 ⊙ 0.3 1 1 0.3 0.07 [0, 1]}]. Thus for Gaussian, 1 1 1 1 (2) 1 0.667 0.3 ∫ 𝑥 ∫ 𝑥 1 −𝑥 2 1 −𝑥 2 = 0.667 1 0.07 CDF (𝑥) = √ exp 2 𝑑𝑥 = √ exp 2 𝑑𝑥 0.3 0.07 1 2𝜋 −∞ 2𝜋 0 (4) −𝑥 2 𝑥 If the advice says that features have low correlations (2nd rule in = [− exp( )] example), we decrease the correlation coefficient. Now, 𝜆 must be 2 0 < 1 and we set 𝜆 = 𝑚𝑎𝑥 (|C|). Since -1 ≤ ∀𝑐 ∈ C ≤ 1.0, the value of 2 𝜆 ≤ 1.0. Thus multiplying by 𝜆 will decrease the correlation value, The inverse CDF can be thus written as CDF −1 (𝑣) = 1−exp( −𝑥2 ) ≤ and the new correlation matrix is, p 𝑣 and the desired values 𝑥 ∈ M can be obtained as 𝑥 = 2𝑙𝑛(1 − 𝑣). 1 0.667 0.3 1 1 0.3 [Step 2]: Calculate the correlation matrix E of M. Ĉ1 = Ĉ ⊙ A = 0.667 1 0.07 ⊙ 1 1 0.3 [Step 3]: Calculate the Cholesky decomposition F of the corre- 0.3 0.07 1 0.3 0.3 1 lation matrix E. Cholesky decomposition [46] of a positive-definite (3) 1 0.667 0.09 matrix is given as the product of a lower triangular matrix and its con- = 0.667 1 0.021 jugate transpose. Note that for Cholesky decomposition to be unique, 0.09 0.021 1 the target matrix should be positive definite, (such as the co-variance matrix) whereas the correlation matrix, used in our algorithm, is only This is used to create the new generated data G̃1 . For negative corre- positive semi-definite. We enforce positive-definiteness by repeated lations, the process is unchanged. addition of very small values to the diagonal of the correlation ma- trix until positive-definiteness is ensured. Given a symmetric and 3.2 Advice-guided data generation positive definite matrix E, its Cholesky decomposition F is such After Ĉ1 is constructed, we next generate data satisfying the con- that E = F · F ⊤ . straints. To this effect, we employ the Iman-Conover method [23], [Step 4]: Calculate the Cholesky decomposition Q of the correla- a distribution free method to define dependencies between distri- tion matrix obtained after modifications based on human advice, Ĉ. butional variables based on rank correlations such as Spearman or As above the Cholesky decomposition is such that Ĉ = Q · Q ⊤ . Kendell Tau correlations. Since we deal with linear relationships [Step 5]: Calculate the reference matrix T by transforming the between the features and assume a normal distribution and that sampled matrix M from step 1 to have the desired correlations of Ĉ, Pearson coefficient has shown to perform equally well with the by using their Cholesky decompositions. Iman-Conover method [40] due to the close relationship between [Step 6]: Rearrange values in columns of the generated data G1 Pearson and Spearman correlations, we use the Pearson correlations. to have the same ordering as corrresponding column in the reference Further, we assume that the features are Gaussian, justified by the matrix T to obtain the final generated data G̃1 . fact that most lab test data is continuous. The Iman-Conover method consists of the following steps: Cholesky decomposition to model correlations: Given an ran- [Step 1]: Create a random standardized matrix M with values domly generated data set with no correlations P, a correlation matrix 𝑥 ∈ M ∼ Gaussian distribution. This is obtained by the process of C and its Cholesky decomposition Q, data that faithfully follows inverse transform sampling described next. Let V1 be a uniformly the given correlations ∈ C can be generated by the product of the distributed random variable and CDF be the cumulative distribution obtained lower triangular matrix with the original uncorrelated data KiML’20, August 24, 2020, San Diego, California, USA, Devendra Singh Dhami, Mayukh Das, and Sriraam Natarajan i.e. P̂=QP. The correlation of the newly obtained data, P̂ is, to plan and prognosticate treatments. The data consists of 19 features with 44 positive and 6 negative examples. 𝐶𝑜𝑣 ( P̂) E[ P̂ P̂ ⊤ ] − E[ P̂]E[ P̂] ⊤ 𝐶𝑜𝑟𝑟 ( P̂) = = (5) (2) MIMIC database [24] consists of deidentified information 𝜎 P̂ 𝜎 P̂ of patients admitted to critical care units at a large tertiary Since we consider data P̂ from a Gaussian distribution with zero care hospital. The features included are predominately time mean and unit variance, window aggregations of physiological measurements from the medical records. We selected relevant lab results, vital E[ P̂ P̂ ⊤ ] − E[ P̂]E[ P̂] ⊤ sign observations and feature aggregations. The data consists 𝐶𝑜𝑟𝑟 ( P̂) = = E[ P̂ P̂ ⊤ ] = E[(QP)(QP) ⊤ ] 𝜎 P̂ of 18 with 5813 positive and 40707 negative examples. = E[QPQ ⊤ P ⊤ ] = QE[PP ⊤ ]Q ⊤ = QQ ⊤ = C Advice Acquisition: Here we compile the sources from which we (6) obtain the advice. Thus Cholesky decomposition can capture the desired correlations (1) Nephrotic Syndrome: This is a novel real data set and the ad- faithfully and can be used for generating correlated data. Since we al- vice is obtained from a nephrologist in India. According ready have a normal sampled matrix M and a calculated correlation to the problem statement from the expert, nephrotic syndrome E of M, we need to calculate a reference matrix (step 5). involves the loss of a lot of protein and nephritic syndrome involves the loss of a lot of blood through urine. A kidney 3.3 Human-Allied GAN training biopsy is often required to diagnose the underlying pathol- Since the human expert advice is provided independent of the GAN ogy in patients with suspected glomerular disease. The goal architecture, our method is agnostic of the underlying GAN architec- of the project is to build a clinical support system that pre- ture. We make use of Wasserstein GAN (WGAN) architecture since dicts the disease using clinical features, thus reducing the its shown to be more stable while training and can handle mode need of kidney biopsy. Since the data collection is scarce, collapse [3]. Only the error backpropagation values differ when we a synthetic data set can help in better understanding of the are using the data generated by the underlying GAN or the data disease from the clinical features. generated by the Iman-Conover method. Our algorithm starts with (2) MIMIC: The feature set and the expected correlations are the general process of training a GAN where the generator takes obtained in consultation with trauma experts at a Dallas random noise as an input and generates data which is then passed, hospital. along with the real data, to the discriminator. The discriminator tries to identify the real and generated data and the error is back All experiments were run on a 64-bit Intel(R) Xeon(R) CPU E5-2630 propagated to the generator. After every specified number of itera- v3 server for 10K epochs. Both the generator and discriminator are tions, the correlations between features C in the generated data is neural networks with 4 hidden layers. To measure the quality of the obtained and a new correlation matrix Ĉ, is obtained with respect generated data we make use of the train on synthetic, test on real to the expert advice (section 3.1). A new data set is generated wrt (TSTR) method as proposed in [12]. We use gradient boosting with Ĉ using the Iman-Conover method (Section 3.2) and then passed to 100 estimators and a learning rate of 0.01 as the underlying model. the discriminator along with the real data set. We train the GAN for 10K epochs and provide correlation advice every 1K iterations. 4 EXPERIMENTAL EVALUATION Table 1 shows the results of the TSTR method with data generated with (HA-GAN𝐺𝐴 ) and without advice (GAN). It shows that the We aim to answer the following questions: data generated with advice has higher TSTR performance than the Q1: Does providing advice to GANs help in generating better data generated without advice across all data sets and all metrics. quality data? Thus, to answer Q1, providing advice to generative adversarial net- Q2: Are GANs with advice effective for data sets that have few works captures the relationship between features better and thus are examples? able to generate better quality synthetic data. Q3: How does bad advice affect the quality of generated data? Learning with less data: GANs with advice are especially impres- Q4: How well does human advice handle class imbalance? sive in nephrotic syndrome data which consists of only 50 examples Q5: How does our method compare to state-of-the-art GAN archi- across all metrics and is thus very small in size when compared to the tectures. number of samples typically required to train a GAN model. Thus, We consider 2 real clinical data sets. we realize an important property of incorporating human guidance in (1) Nephrotic Syndrome is a novel data set of symptoms that the GAN model and can answer Q2 affirmatively. The use of advice indicate kidney damage. This consists of 50 kidney biopsy opens up the potential of using GANs in presence of sparse data images along with the clinical reports sourced from Dr Lal samples. PathLabs, India 2 . We use the clinical reports that consist of Effect of bad advice: Table 1 also shows the results for data gen- the values for kidney tissue diagnosis which can confirm the erated with bad advice (HA-GAN𝐵𝐴 ). To simulate bad advice, we clinical diagnosis and help to identify high-risk patients and follow a simple process: if the advice says that the correlation be- influence treatment decisions and help medical practitioners tween features should be high, we set the correlations in Ĉ to 0 and if the advice says that the correlation should be low, we set the 2 https://www.lalpathlabs.com/ correlations in Ĉ to be either 1 or -1 based on whether the original Knowledge Intensive Learning of KiML’20, August 24, 2020, San Diego, California, USA, Generative Adversarial Networks Table 1: TSTR Results (≈ 3 𝑑𝑒𝑐.). N/A in Nephrotic Syndrome denotes that all generated labels were of a single class (0 in our case) and thus we were not able to run the discriminative algorithm in the TSTR method. 𝐺𝐴 and 𝐵𝐴 denotes good and bad advice to our HA-GAN model respectively. Data set Methods Recall F1 AUC-ROC AUC-PR GAN 0.584 0.666 0.509 0.911 HA-GAN𝐵𝐴 0.42 0.511 0.518 0.886 medGAN N/A N/A N/A N/A NS medWGAN N/A N/A N/A N/A medBGAN N/A N/A N/A N/A HA-GAN𝐺𝐴 1.0 0.943 0.566 0.947 GAN 0.122 0.119 0.495 0.174 HA-GAN𝐵𝐴 0.285 0.143 0.459 0.235 medGAN 0.374 0.163 0.478 0.279 MIMIC medWGAN 0.0 0.0 0.5 0.562 medBGAN 0.0 0.0 0.5 0.562 HA-GAN𝐺𝐴 0.979 0.263 0.598 0.567 correlation is positive or negative. Thus, given a correlation matrix in table 1 where advice based data generation outperforms the non- 1 advice and bad advice based data generation. Thus, we can answer 0.2 0.3 Q4 affirmatively. C = 0.2 1 0.07 (7) 0.3 0.07 To answer Q5 we compare our method to 3 GAN architectures, 1 medGAN [8] which uses an encoder decoder framework for EHR suppose the advice says that we need to increase the correlation data generation and its 2 variants medBGAN and medWGAN [4] coefficient between feature 1 and feature 2. Then the new correlation and the results are shown in table 1. Our method, with good advice, matrix after bad advice can be calculated as: outperforms the baseline both domains showing the effectiveness of 1 0.2 0.3 1 𝜆 1 our method. C = 0.2 1 0.07 A = 𝜆 1 1 (8) 0.3 0.07 1 1 1 1 5 CONCLUSION 1 0.2 0.3 1 𝜆 1 We presented a new GAN formulation that employs correlation information between features as advice to generate new correlated Ĉ = C ⊙ A = 0.2 1 0.07 ⊙ 𝜆 1 1 (9) 0.3 0.07 1 1 1 1 data and train the underlying GAN model. We tested our model on real clinical data sets and show that incorporating advice helps where 𝜆 is the factor by which the correlation value is to be aug- generate good quality synthetic medical data. We employ TSTR mented. Since the advice asks to increase the correlation, we set 𝜆=0. method to test the quality of generated data and demonstrated that Thus, the generated data with advice is more aligned with the real data. 1 0.2 0.3 1 0 1 1 0.0 0.3 There are several future interesting directions. First, providing advice Ĉ = 0.2 1 0.07 ⊙ 0 1 1 = 0.0 1 0.07 (10) only when required in an active fashion can allow for significant 0.3 0.07 1 1 1 1 0.3 0.07 1 reduction in the amount of effort on the human side. Second, there Similarly, if the advice says that we need to decrease the correla- can be multiple advice options, such as posterior regularization [15], tion coefficient between feature 1 and feature 3, we set 𝜆 = 𝑓 𝑒𝑎𝑡1 . that can be used to capture feature relationships explicitly. Third, 𝑣𝑎𝑙 although we do not have identifiers in the data, thereby eliminating 1 0.2 0.3 1 0.2 1 1 0.2 1.0 the need of differential privacy [11], a general framework that can 0.3 Ĉ = 0.2 1 0.07 ⊙ 0.2 1 1 = 0.2 1 0.07 uphold the privacy of patient data along the lines of using Cholesky 0.3 0.07 1 0.3 1 1 1 1.0 0.07 1 decomposition [7, 31] is a natural next step. (11) ACKNOWLEDGMENTS As results show in table 1, giving bad advice adversely affects the DSD and SN gratefully acknowledge DARPA Minerva award FA9550- performance thereby answering Q3. 19-1-0391. Any opinions, findings, and conclusion or recommenda- The nephrotic syndrome and MIMIC data sets are relatively unbal- tions expressed in this material are those of the authors and do not anced with a pos to neg ratio of ≈ 8:1 and 1:7 respectively. Most necessarily reflect the view of the DARPA or the US government. of the medical data sets, except highly curated data sets, are un- balanced. A data generator model should be able to handle this imbalance. Since our method explicitly focuses on the correlations REFERENCES between features and generates better quality data based on such [1] Grigory Antipov, Moez Baccouche, and Jean-Luc Dugelay. 2017. Face aging with conditional generative adversarial networks. In ICIP. relationships between features, our method is quite robust to the [2] Martin Arjovsky and Leon Bottou. 2017. Towards principled methods for training imbalance in the underlying data. This can be seen in the results generative adversarial networks. In ICLR. KiML’20, August 24, 2020, San Diego, California, USA, Devendra Singh Dhami, Mayukh Das, and Sriraam Natarajan [3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. [36] V. K. Mansinghka, C. Kemp, J. B. Tenenbaum, and T. L. Griffiths. 2006. Structured ICML (2017). Priors for Structure Learning. In UAI. [4] Mrinal Kanti Baowaly, Chia-Ching Lin, Chao-Lin Liu, and Kuan-Ta Chen. 2019. [37] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Synthesizing electronic health records using improved generative adversarial Paul Smolley. 2017. Least squares generative adversarial networks. In ICCV. networks. JAMA (2019). [38] Tom M Mitchell. 1980. The need for biases in learning generalizations. Depart- [5] Darius Braziunas and Craig Boutilier. 2006. Preference elicitation and generalized ment of Computer Science, Laboratory for Computer Science Research, Rutgers additive utility. In AAAI. Univ. New Jersey. [6] Anna L Buczak, Steven Babin, and Linda Moniz. 2010. Data-driven approach [39] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. for creating synthetic electronic medical records. BMC medical informatics and Spectral normalization for generative adversarial networks. ICLR (2018). decision making (2010). [40] Klemen Naveršnik and Klemen Rojnik. 2012. Handling input correlations in [7] Jim Burridge. 2003. Information preserving statistical obfuscation. Statistics and pharmacoeconomic models. Value in Health (2012). Computing (2003). [41] P. Odom, T. Khot, R. Porter, and S. Natarajan. 2015. Knowledge-Based Proba- [8] Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F Stewart, bilistic Logic Learning. In AAAI. and Jimeng Sun. 2017. Generating Multi-label Discrete Patient Records using [42] Phillip Odom and Sriraam Natarajan. 2015. Active advice seeking for inverse Generative Adversarial Networks. In MLHC. reinforcement learning. In AAAI. [9] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine [43] Phillip Odom and Sriraam Natarajan. 2018. Human-guided learning for proba- Learning (1995). bilistic logic models. Frontiers in Robotics and AI (2018). [10] Ivo D Dinov. 2016. Volume and value of big healthcare data. Journal of medical [44] Michela Paganini, Luke de Oliveira, and Benjamin Nachman. 2018. Calo- statistics and informatics (2016). GAN: Simulating 3D high energy particle showers in multilayer electromagnetic [11] Cynthia Dwork. 2008. Differential privacy: A survey of results. In TAMS. calorimeters with generative adversarial networks. Physical Review D (2018). [12] Cristóbal Esteban, Stephanie L Hyland, and Gunnar Rätsch. 2017. Real-valued [45] Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised represen- (medical) time series generation with recurrent conditional gans. arXiv preprint tation learning with deep convolutional generative adversarial networks. ICLR arXiv:1706.02633 (2017). (2016). [13] Maayan Frid-Adar, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit [46] Ernest M Scheuer and David S Stoller. 1962. On the generation of normal random Greenspan. 2018. Synthetic data augmentation using GAN for improved liver vectors. Technometrics (1962). lesion classification. In ISBI. [47] Bernhard Schölkopf, Patrice Simard, Alex J Smola, and Vladimir Vapnik. 1998. [14] Glenn M Fung, Olvi L Mangasarian, and Jude W Shavlik. 2003. Knowledge-based Prior knowledge in support vector kernels. In Advances in neural information support vector machine classifiers. In NIPS. processing systems. 640–646. [15] Kuzman Ganchev, Jennifer Gillenwater, Ben Taskar, et al. 2010. Posterior regular- [48] Rittika Shamsuddin, Barbara M Maweu, Ming Li, and Balakrishnan Prabhakaran. ization for structured latent variable models. JMLR (2010). 2018. Virtual patient model: an approach for generating synthetic healthcare time [16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. series data. In ICHI. [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, [49] Jude W Shavlik and Geoffrey G Towell. 1989. Combining explanation-based Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial learning and artificial neural networks. In Proceedings of the sixth international nets. In NIPS. workshop on Machine learning. Elsevier. [18] Peter Groves, Basel Kayyali, David Knott, and Steve Van Kuiken. 2016. The’big [50] Geoffrey G Towell and Jude W Shavlik. 1994. Knowledge-based artificial neural data’revolution in healthcare: Accelerating value and innovation. (2016). networks. Artificial intelligence (1994). [19] John T Guibas, Tejpal S Virdi, and Peter S Li. 2017. Synthetic medical images [51] Yan Wang, Biting Yu, Lei Wang, Chen Zu, David S Lalush, Weili Lin, Xi Wu, Jiliu from dual generative adversarial networks. arXiv preprint arXiv:1709.01872 Zhou, Dinggang Shen, and Luping Zhou. 2018. 3D conditional generative adver- (2017). sarial networks for high-quality PET image estimation at low dose. NeuroImage [20] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C (2018). Courville. 2017. Improved training of wasserstein gans. In NIPS. [52] Zongwei Wang, Xu Tang, Weixin Luo, and Shenghua Gao. 2018. Face aging with [21] Haroun Habeeb, Ankit Anand, Mausam Mausam, and Parag Singla. 2017. Coarse- identity-preserved conditional generative adversarial networks. In CVPR. to-fine lifted MAP inference in computer vision. In IJCAI. [53] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. [22] Zhiting Hu, Zichao Yang, Russ R Salakhutdinov, LIANHUI Qin, Xiaodan Liang, 2016. Learning a probabilistic latent space of object shapes via 3d generative- Haoye Dong, and Eric P Xing. 2018. Deep Generative Models with Learnable adversarial modeling. In NIPS. Knowledge Constraints. In NeurIPS. [54] S. Yang and S. Natarajan. 2013. Knowledge Intensive Learning: Combining [23] Ronald L Iman and William-Jay Conover. 1982. A distribution-free approach to Qualitative Constraints with Causal Independence for Parameter Learning in inducing rank correlation among input variables. Communications in Statistics- Probabilistic Models. In ECMLPKDD. Simulation and Computation (1982). [55] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired [24] Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, image-to-image translation using cycle-consistent adversarial networks. In ICCV. Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data (2016). [25] Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator archi- tecture for generative adversarial networks. In CVPR. [26] Harold William Kuhn and Albert William Tucker. 1953. Contributions to the Theory of Games. [27] Gautam Kunapuli, Phillip Odom, Jude W Shavlik, and Sriraam Natarajan. 2013. Guiding autonomous agents to better behaviors through human advice. In ICDM. [28] Quoc V Le, Alex J Smola, and Thomas Gärtner. 2006. Simpler knowledge-based support vector machines. In ICML. [29] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature (2015). [30] Minjun Li, Haozhi Huang, Lin Ma, Wei Liu, Tong Zhang, and Yugang Jiang. 2018. Unsupervised image-to-image translation with stacked cycle-consistent adversarial networks. In ECCV. [31] Yaping Li, Minghua Chen, Qiwei Li, and Wei Zhang. 2011. Enabling multilevel trust in privacy preserving data mining. TKDE (2011). [32] Yujia Li, Kevin Swersky, and Rich Zemel. 2015. Generative moment matching networks. In ICML. [33] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. 2017. Unsupervised image-to-image translation networks. In NIPS. [34] Ming-Yu Liu and Oncel Tuzel. 2016. Coupled generative adversarial networks. In NIPS. [35] Faisal Mahmood, Richard Chen, and Nicholas J Durr. 2018. Unsupervised reverse domain adaptation for synthetic medical images via adversarial training. IEEE transactions on medical imaging (2018). Depressive, Drug Abusive, or Informative: Knowledge-aware Study of News Exposure during COVID-19 Outbreak Amanuel Alambo Manas Gaur Krishnaprasad Thirunarayan Knoesis Center AI Institute, University of South Knoesis Center Dayton, Ohio Carolina Dayton, Ohio amanuel@knoesis.org Columbia, South Carolina tkprasad@knoesis.org mgaur@email.sc.edu ABSTRACT on Knowledge-infused Mining and Learning (KiML’20). , 5 pages. https://doi. The COVID-19 pandemic is having a serious adverse impact on org/10.1145/nnnnnnn.nnnnnnn the lives of people across the world. COVID-19 has exacerbated community-wide depression, and has led to increased drug abuse brought about by isolation of individuals as a result of lockdown. 1 INTRODUCTION Further, apart from providing informative content to the public, the incessant media coverage of COVID-19 crisis in terms of news COVID-19 pandemic has changed our societal dynamics in different broadcasts, published articles and sharing of information on social ways due to the varying impact of the news articles and broadcasts media have had the undesired snowballing effect on stress levels on a diverse population in the society. Thus, it is important to (further elevating depression and drug use) due to uncertain future. place the news articles in their spatio-temporal-thematic (Nagarajan In this position paper, we propose a novel framework for assessing et al., 2009; Andrienko et al., 2013; Harbelot et al., 2015) contexts to the spatio-temporal-thematic progression of depression, drug abuse, offer appropriate and timely response and intervention. In order and informativeness of the underlying news content across the to limit the scope of this research agenda, we propose to focus different states in the United States. Our framework employs an on identifying regions that are exposed to depressive and drug attention-based transfer learning technique to apply knowledge abusive news articles and to determine/recommend ways for timely learned on a social media domain to a target domain of media interventions by epidemiologists. exposure. To extract news articles that are related to COVID-19 The impact of COVID-19 on mental health has been investigated communications from the streaming news content on the web, we in recent studies (Garfin et al., 2020; Holmes et al., 2020; Qiu et al., use neural semantic parsing, and background knowledge bases in a 2020). [4] studied the impact of repeated media exposure on the men- sequence of steps called semantic filtering. We achieve promising tal well-being of individuals and its ripple effects. [8] underscore preliminary results on three variations of Bidirectional Encoder the importance of a multidisciplinary study to better understand Representations from Transformers (BERT) model. We compare COVID-19. Specifically, the study explores its psychological, social, our findings against a report from Mental Health America and the and neuroscientific impacts. [12] studied the psychological impact results show that our fine-tuned BERT models perform better than COVID-19 lockdown had on the Chinese population. These studies, vanilla BERT. Our study can benefit epidemiologists by offering however, do not adequately explore a technique to computationally actionable insights on COVID-19 and its regional impact. Further, analyze the regional repercussions associated with media exposure our solution can be integrated into end-user applications to tailor to COVID-19 that may provide a better basis for local grassroots news for users based on their emotional tone measured on the scale level action. of depressiveness, drug abusiveness, and informativeness. We propose an approach to measure depressiveness, drug abu- siveness, and informativeness as a result of media exposure for various states in the US in the months from January 2020 to March KEYWORDS 2020. Our study is focused on the first quarter of 2020 as this period COVID-19; Spatio-Temporal-Thematic; Depressiveness; Drug was critical in the spread of COVID-19 and its ominous impact; Abuse; Informativeness; Transfer Learning this was a period when the public faced major changes to lifestyle ACM Reference Format: including lockdown, social distancing, closure of businesses, unem- Amanuel Alambo, Manas Gaur, and Krishnaprasad Thirunarayan. 2020. ployment, and broadly speaking, complete lack of control over the Depressive, Drug Abusive, or Informative: Knowledge-aware Study of News unfolding situation precipitating in severe uncertainty about the Exposure during COVID-19 Outbreak . In Proceedings of KDD Workshop impending future. In consequence, this continued media exposure progressively worsened the mental health of individuals across the board. We analyze and score news content on three orthogonal In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, dimensions: spatial, temporal, and thematic. For spatial, we use California, USA, August 24, 2020. Use permitted under Creative Commons License state boundaries. For temporal, we use monthly data analysis. For Attribution 4.0 International (CC BY 4.0). thematic, we score news content on the category/dimension of KiML’20, August 24, 2020, San Diego, California, USA, © 2020 Copyright held by the author(s). depression, drug abuse and informativeness (relevant to COVID-19 https://doi.org/10.1145/nnnnnnn.nnnnnnn but not directly connected to either depression or drug-abuse). and grouped the ones that are from the US based on their state of origination. The state-level grouped news articles had a total of over 150K entities identified using DBpedia spotlight service2 . However, since using a coarse filtering service such as DBpedia spotlight over the entire news articles is not efficient and brings in irrelevant entities, and thus noisy news articles, we utilize (“i”) a neural parsing approach with self-attention (Wu et al., 2019) to extract relevant entities. After extracting relevant entities and news articles, we use (“ii”) DBpedia spotlight service to identify news articles that are related to online communications about COVID-19. Figure 1: Spatio-Temporal-Thematic Dimensions Figure 2: Knowledge-based entity extraction using Semantic Filtering Our study hinges on the use of domain-specific language model- ing and transfer learning to better understand how depressiveness, For this task, we explored 780 DBpedia categories that are rel- drug abusiveness, and informativeness of news articles evolve in evant to COVID-19 communications to create the most relevant response to media exposure by people. We conduct the transfer set of entities and news articles. Further, upon inspection of the of knowledge learned on a social media platform to the domain news articles, we discovered medical terms that were not available of exposure to news using variations of the attention-based BERT in DBpedia. As a result, we used (“iii”) the MeSH terms hierarchy model (Devlin et al., 2018), also called Vanilla BERT. Thus, in addi- in Unified Medical Language System (UMLS), the Diagnostic and tion to vanilla BERT, we fine-tune BERT models on corpora that Statistical Manual for Mental Disorders (DSM-5) lexicon (Gaur et al., are representative of depression and drug abuse. Then, we compare 2018), and Drug Abuse Ontology (DAO), collectively referred to results obtained using the three variants of the BERT model. For as Mental Health and Drug Abuse Knowledgebase (MHDA-Kb) to scoring depressiveness, drug abusiveness, and informativeness of spot additional entities. Thus, from 700K unique news articles news articles, we utilize entities from structured domain knowledge (which are extracted from the total of 1.2 Million news articles by from the Patient Health Questionnaire (PHQ-9) lexicon (Yazdavar removing duplicates), we created a set of 120K unique entities that et al., 2017), Drug Abuse Ontology (DAO) (Cameron et al., 2013), are described by the 780 DBpedia categories and 225 concepts in and DBpedia (Lehmann et al., 2015). PHQ-9 lexicon is a knowl- MHDA-Kb. The figures below show two examples that illustrate edge base developed specifically for assessing depression, and DAO entities spotted during entity extraction on a sample news article. is built to study drug abuse. Similarly, we use DBpedia, which is A news article that has entities identified using this sequence of a generic and comprehensive knowledge base, for assessing the steps is selected for our study. informativeness of news content. Having determined the scores for depressiveness, drug abusive- ness, and informativeness of news articles for each state during the three months, we computed the aggregate score for each the- matic category by summing up the scores for the news articles. We finally assigned the category with the highest score as a label for a state. For instance, if the aggregate score of depressiveness for the state of Iowa in the month of January 2020 is the highest of the three thematic categories, then the state of Iowa is assigned a Figure 3: Example entity extraction-I using Semantic Filter- label of depression for that month, which means the state of Iowa ing is most exposed to depressive news contents. Thus, identifying which states are consistently exposed to depressive or drug abusive news contents enables policy makers and epidemiologists to devise appropriate intervention strategies. 2 DATA COLLECTION We collected 1.2 Million news articles from the Web and GDELT1 (a resource that stores world news on significant events from different countries) using semantic filtering (Sheth and Kapanipathi, 2016) Figure 4: Example entity extraction-II using Semantic Filter- and spanning the period from January 01, 2020, to March 29, 2020. ing We filtered news articles that did not originate from within the US 1 https://www.gdeltproject.org/ 2 https://www.dbpedia-spotlight.org/ 2 3 METHODS scores of news articles as described. The category with the highest We propose to use three variations of the BERT model for represent- cumulative score is set as the label for a state. ing news articles. In its basic form, we use vanilla BERT for encoding Using vanilla-BERT (Figure 5), we can see that no state shows news articles. For the remaining two variations, we fine-tune BERT exposure to news content on drug abuse in January. Going from on a binary sequence classification task by independently training February to March, we see depressive news content move from on two corpora using masked language modeling (MLM) and next inner-most states such as Missouri, Kansas, and Colorado to border sentence prediction (NSP) objectives. The two corpora used are: 1) states such as California, Montana, North Dakota, and Louisiana, Subreddit Depression (Gkotsis et al., 2017; Gaur et al., 2018); 2) A making way for informative news content. Further, there are fewer combination of subreddits: Crippling Alcoholism, Opiates, Opiates states exposed to drug-related news content than those exposed Recovery, and Addiction (abbreviated COOA), each consisting of to depressive or informative news content in February or March. Reddit posts about drug abuse. Subreddit Depression has 760049 Particularly, Arizona and Virginia show consistent exposure to posts across 121795 Redditors, and COOA has 1416765 posts from drug-related news content in February and March. 46183 users, both consisting of posts from the years 2005 - 2016. Using depression-BERT, as shown in Figure 6, we see that states Reddit posts belonging to subreddits depression or COOA are con- such as Texas, and Kansas are exposed to depressive news content sidered positive classes and the 380444 posts from control group for the month of January and February while states such as Cali- (∼10K subreddits unrelated to mental health) as negative classes. fornia, Montana, Alaska, and Michigan show higher consumption We use the following settings for training our BERT model for se- of depressive news content in February and March. With regard to quence classification: training batch size of 16, maximum sequence informativeness, we see an overall even distribution of informative length of 256, Adam optimizer with learning rate of 2e-5, number of news content across the nation in February and March. Further, training epochs set to 10, and a warmup proportion of 0.1. We used we see a few midwest states showing relatively higher instances of 40%-60% split for training and testing sets for creating the BERT news content that are informative than depressive in February and models and achieved a test accuracy of 89% for Depression-BERT March. It’s interesting to see a few southern states such as Okla- and 78% for Drug Abuse-BERT. We set the size of the training set homa, Texas, and Arkansas transition from exposure to depressive smaller than the testing set for generalizability of our models. In news content in the month of February to drug use related news this manuscript, we refer to the BERT model fine tuned on subreddit content in the month of March. depression as Depression-BERT or DPR-BERT, while the one fine Using Drug Abuse-BERT model (Figure 7), states such as Texas, tuned on subreddit COOA as Drug Abuse-BERT or DA-BERT. and Wisconsin shift from exposure of depressive news content in In addition to using BERT for encoding news contents, we also January to exposure of drug-related news content in February, while use it for representing the entities in the background knowledge states such as California, and Oklahoma transition from exposure to bases (i.e., PHQ-9, DAO, and DBpedia). Once we have encoded the depressive news content in February to drug-related news content news articles and the entities in the knowledge bases using vanilla in March. Further, we see the informativeness of news content BERT or fine-tuned BERT model, we generated depressiveness sweeping from the east to the midwest, to parts of the south, and score, drug abusiveness score, and informativeness score corre- to some parts of the west from February to March. sponding to the entities in PHQ-9, DAO, and DBpedia respectively. Our results show that a fine-tuned BERT model cleanly separates The equation below gives the score of a news article for a category the thematic categorical scores to a state. For instance, using DA- given one of the BERT models: BERT for the month of March, the drug abuse score for the state of California is much higher than the score of depressiveness or informativeness for the same state. However, with the vanilla BERT |E𝐾𝐵 | 1 Õ model, the three scores computed for the various states and months 𝑆𝑐𝑜𝑟𝑒𝑐𝑚 (𝑛𝑒𝑤𝑠) = 𝑐𝑜𝑠𝑠𝑖𝑚 (news, 𝑒) (1) are marginally different. Moreover, the results using DPR-BERT or |E𝐾𝐵 | 𝑒=1 DA-BERT capture the state-level ranking of mental disorders by Mental Health America 3 better than vanilla-BERT; for a few states, where, the fine-tuned BERT models identify more months to have media m ∈ {vanilla-BERT, DPR-BERT, DA-BERT} exposure to depression or drug abuse news content. c ∈ {informativeness, depressiveness, drug abuse} cossim (news, e): cosine similarity between a news content and As indicated in Table 1, we report months showing predominant an entity in KB media exposure to either depressive or drug abuse news articles KB - a collection of entities present in PHQ-9, DBpedia, or DAO using the three variants of BERT model. We use 10 of the 13 states recognized as showing high prevalence of mental disorders accord- We used the base variant of the BERT model with 12 layers, 768 ing to a report by Mental Health America on overall mental disorder hidden units, and 12 attention heads. We use PyTorch 1.5.0+cu101 ranking. The 3 states not included in this table are Washington, for fine-tuning our BERT models. All our programs were run on Wyoming, and Idaho. We did not consider these 3 states as these Google Colab’s NVIDIA Tesla P100 PCI-E GPU. states were not in our dataset cohort. For the Mental Health Amer- ica (MHA) report, we make a practical assumption that each of the 4 PRELIMINARY RESULTS AND DISCUSSION three months is either depressive or drug abusive for each state. Thus, our objective is to maximize the number of months with In this section, we report the state-wise labels (i.e., depressive, drug abusive, informative) for each month obtained after summing the 3 https://www.mhanational.org/issues/ranking-states 3 Figure 5: vanilla BERT modeling of Depressiveness, Drug Abuse, and Informativeness in US states. Figure 6: Depression-BERT (DPR-BERT) modeling of Depressiveness, Drug Abuse, and Informativeness in US states Figure 7: Drug Abuse BERT (DA-BERT) modeling of Depressiveness, Drug Abuse, and Informativeness in US states exposure to depressive/drug abuse news content for each of the where, 10 states. We can see in Table 1 that fine-tuned BERT models help 𝑚 1, 𝑚 2 ∈ {vanilla-BERT, DPR-BERT, DA-BERT, MHA} identify more months to having exposure to depressive or drug 𝑆 - Set of States in the US (Table 1) abuse news content than vanilla BERT does for the 10 states. For ex- 𝑚𝑀 𝑀 1 , 𝑚 2 : Number of depressive, drug abusive, or informative ample, using DA-BERT, five states are identified to have at least two months for a state “i” months showing exposure to depressive/drug abuse news content We report inter-model and model-to-MHA Jaccard similarity while DPR-BERT identifies six states to having been exposed to scores computed using equation (2) in Figure 8. depressive/drug abuse news content for two months. On the other As shown in Figure 8, DA-BERT gives the best results against hand, vanilla-BERT identifies only two states with depressive/drug MHA report in Jaccard similarity (0.53), which means DA-BERT abuse news content for two months. To compare models with one identifies over half of the state-to-month instances in MHA. On the another and against the report by Mental Health America (MHA), other hand, vanilla-BERT has a Jaccard similarity of 0.37 with MHA, we compute a Jaccard Index between each pair of models and each which can be interpreted as vanilla-BERT identifies a little over model against the report from MHA. The equation below computes one-third of the state-to-month instances in MHA. The best Jaccard Jaccard similarity between the results of two models or a model’s similarity is achieved between DPR-BERT and vanilla-BERT (0.7); results with an MHA report. thus, 70% of state-to-month mappings are shared between DPR- BERT and vanilla-BERT based on Jaccard index. It’s interesting to |𝑆 | see DA-BERT has the same Jaccard similarity with vanilla-BERT Õ 𝑚𝑀 ∩ 𝑚𝑀 1 2 𝐽 (𝑚 1, 𝑚 2 ) = 𝑀 𝑀 (2) 𝑖 ∈ 𝑆 𝑚1 ∪ 𝑚2 4 MHA States vanilla- DA-BERT DPR-BERT from Mental Health America. In the future, we plan to incorporate with high BERT (Months (Months background knowledge bases in our attention-based transfer learn- DPR and DA (Months with depres- with ing framework to further investigate knowledge-infused learning with depres- sion/drug depres- (Kursuncu et al., 2019). sion/drug abuse) sion/drug abuse) abuse) REFERENCES [1] Gennady Andrienko, Natalia Andrienko, Harald Bosch, Thomas Ertl, Georg Fuchs, Tennessee Feb, Mar Feb, Mar Feb, Mar Piotr Jankowski, and Dennis Thom. 2013. Thematic patterns in georeferenced Alabama Feb Feb, Mar Feb tweets through space-time visual analytics. Computing in Science & Engineering 15, 3 (2013), 72–82. Oklahoma Mar Feb, Mar Feb, Mar [2] Delroy Cameron, Gary A Smith, Raminta Daniulaityte, Amit P Sheth, Drashti Kansas Feb Jan, Feb Jan, Feb Dave, Lu Chen, Gaurish Anand, Robert Carlson, Kera Z Watkins, and Russel Falck. 2013. PREDOSE: a semantic web platform for drug abuse epidemiology Montana Mar Feb Feb, Mar using social media. Journal of biomedical informatics 46, 6 (2013), 985–997. South Carolina Mar Mar Feb, Mar [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Alaska Feb, Mar Jan, Feb, Mar Feb, Mar Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Utah Mar Mar Mar [4] Dana Rose Garfin, Roxane Cohen Silver, and E Alison Holman. 2020. The novel Oregon None Feb None coronavirus (COVID-2019) outbreak: Amplification of public health consequences Nevada Feb Feb None by media exposure. Health psychology (2020). [5] Manas Gaur, Ugur Kursuncu, Amanuel Alambo, Amit Sheth, Raminta Daniu- Table 1: Evaluation of base and domain-specific BERT mod- laityte, Krishnaprasad Thirunarayan, and Jyotishman Pathak. 2018. " Let Me Tell You About Your Mental Health!" Contextualized Classification of Reddit Posts to els for MHA states over the period of three months (January, DSM-5 for Web-based Intervention. In Proceedings of the 27th ACM International February, and March). These three months showed high dy- Conference on Information and Knowledge Management. 753–762. [6] George Gkotsis, Anika Oellrich, Sumithra Velupillai, Maria Liakata, Tim JP Hub- namicity in COVID-19 spread. bard, Richard JB Dobson, and Rina Dutta. 2017. Characterisation of mental health conditions in social media using Informed Deep Learning. Scientific reports 7 (2017), 45141. [7] Benjamin Harbelot, Helbert Arenas, and Christophe Cruz. 2015. LC3: A spatio- temporal and semantic model for knowledge discovery from geospatial datasets. Journal of Web Semantics 35 (2015), 3–24. [8] Emily A Holmes, Rory C O’Connor, V Hugh Perry, Irene Tracey, Simon Wes- sely, Louise Arseneault, Clive Ballard, Helen Christensen, Roxane Cohen Silver, Ian Everall, et al. 2020. Multidisciplinary research priorities for the COVID-19 pandemic: a call for action for mental health science. The Lancet Psychiatry (2020). [9] Ugur Kursuncu, Manas Gaur, and Amit Sheth. 2019. Knowledge Infused Learning (K-IL): Towards Deep Incorporation of Knowledge in Deep Learning. arXiv preprint arXiv:1912.00512 (2019). [10] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. 2015. DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6, 2 (2015), 167–195. [11] Meenakshi Nagarajan, Karthik Gomadam, Amit P Sheth, Ajith Ranabahu, Raghava Mutharaju, and Ashutosh Jadhav. 2009. Spatio-temporal-thematic analy- sis of citizen sensor data: Challenges and experiences. In International Conference Figure 8: Inter-BERT model and BERT Model-to-MHA Jac- on Web Information Systems Engineering. Springer, 539–553. card Similarity Scores as a measure of closeness of model’s [12] Jianyin Qiu, Bin Shen, Min Zhao, Zhen Wang, Bin Xie, and Yifeng Xu. 2020. A nationwide survey of psychological distress among Chinese people in the COVID- prediction to an extensive survey on Mental Health America 19 epidemic: implications and policy recommendations. General psychiatry 33, 2 (MHA). (2020). [13] Amit Sheth and Pavan Kapanipathi. 2016. Semantic filtering for social data. IEEE Internet Computing 20, 4 (2016), 74–78. [14] Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang, and and DPR-BERT, subsuming the former and being subsumed by the Xing Xie. 2019. Npa: Neural news recommendation with personalized attention. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge latter in terms of depressive/drug abusive months. Discovery & Data Mining. 2576–2584. [15] Amir Hossein Yazdavar, Hussein S Al-Olimat, Monireh Ebrahimi, Goonmeet 5 CONCLUSION Bajaj, Tanvi Banerjee, Krishnaprasad Thirunarayan, Jyotishman Pathak, and Amit Sheth. 2017. Semi-supervised approach to monitoring clinical depressive In this paper, we model depressiveness, drug abusiveness, and in- symptoms in social media. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017. 1191–1198. formativeness of news articles to assess the dominant category characterizing each US state during each of the three months (Jan 2020 to Mar 2020). We demonstrate the power of transfer learning by fine-tuning an attention-based deep learning model on a dif- ferent domain and use the domain-tuned model for gleaning the nature of media exposure. Specifically, we use background knowl- edge bases for measuring depressiveness, drug abusiveness, and informativeness of news articles. We found out DA-BERT identifies the most number of state-to-month instances as being exposed to depressive or drug abuse news content according to the report 5 Cost Aware Feature Elicitation Srijita Das Rishabh Iyer Sriraam Natarajan The University of Texas at Dallas The University of Texas at Dallas The University of Texas at Dallas Srijita.Das@utdallas.edu Rishabh.Iyer@utdallas.edu Sriraam.Natarajan@utdallas.edu ABSTRACT tests for reasonably accurate prediction. We build on the intuition Motivated by clinical tasks where acquiring certain features such that given certain observed features like one’s demographic details, as FMRI or blood tests can be expensive, we address the problem of the most important features for a patient depends on the important test-time elicitation of features. We formulate the problem of cost- features for similar patients. Based on this intuition, we find out aware feature elicitation as an optimization problem with trade-off similar data points in the observed feature space and identify the between performance and feature acquisition cost. Our experiments important feature subsets of these similar instances by employing on three real-world medical tasks demonstrate the efficacy and a greedy information theoretic feature selector objective. effectiveness of our proposed approach in minimizing costs and Our contributions in this work are as follows: (1) formalize the maximizing performance. problem as a joint optimization problem of selecting the best feature subset for similar data points and optimizing the loss function using CCS CONCEPTS the important feature subsets. (2) account for acquisition cost in both the feature selector objective and classifier objective to balance • Supervised learning → Budgeted learning; Feature selec- the trade-off between acquisition cost and model performance. (3) tion; • Applications → Healthcare. empirically demonstrate the effectiveness of the proposed approach KEYWORDS on three real-world medical data sets. cost sensitive learning, supervised learning, classification ACM Reference Format: 2 RELATED WORK Srijita Das, Rishabh Iyer, and Sriraam Natarajan. 2020. Cost Aware Feature The related work on cost-sensitive feature selection and learning Elicitation. In Proceedings of KDD Workshop on Knowledge-infused Min- ing and Learning (KiML’20). , 6 pages. https://doi.org/10.1145/nnnnnnn. can be categorized into the following four broad approaches. nnnnnnn Tree based budgeted learning: Prediction time elicitation of fea- tures under a cost budget has been widely studied in literature. A lot of work has been done in tree based models [5, 16, 17, 26–28] 1 INTRODUCTION by adding cost term to the tree objective function in either deci- In supervised classification setting, every instance has a fixed fea- sion trees or ensemble methods like gradient boosted trees. All ture vector and a discriminative function is learnt on such fixed- these methods aim to build an adaptive and complex decision tree length feature vector and it’s corresponding class variable. However, boundary by considering trade-off between performance and test- a lot of practical problems like healthcare, network domains, de- time feature acquisition cost. While we are similar in motivation to signing survey questionnaire [19, 20] etc has an associated feature these approaches, our methodology is different in the sense that acquisition cost. In such domains, there is a cost budget and get- we do not consider tree based models. Instead our approach aims ting all the features of an instance can be very costly. As a result, to find local feature subsets using an information theoretic feature many cost sensitive classifier models [2, 8, 24] have been proposed selector for different clusters of training instance build in a lower in literature to incorporate the cost of acquisition into the model dimensional space. objective during training and prediction. Adaptive classification and dynamic feature discovery: Our Our problem is motivated by such a cost-aware setting where the work also draws inspiration from Nan al.’s work [15] where they assumption is that prediction time features have an acquisition cost learn a high performance costly model and approximate the model’s and adheres to a strict budget. Consider a patient visiting a doctor performance adaptively by building a low cost model and gating for some potential diagnosis of a disease. For such a patient, infor- function which decides which model to use for specific training in- mation like age, gender, ethnicity and other demographic features stances. This adaptive switching between low and high cost model are easily available at zero cost. However, various lab tests that the takes care of the trade-off between cost and performance. Our patient needs to undergo incurs cost. So, a training model should be method is different from theirs because we do not maintain a high able to identify the most relevant (i.e. those which are most infor- cost model which is costly to build and and difficult to decide. We mative, yet least costly) lab tests that are required for each specific refine the parameters of a single low cost model by incorporating a patient. The intuition of this work is that different patients, depend- cost penalty in the feature selector and model objective. Our work ing on their history, ethnicity, age and gender, may require different is also along the direction of Nan et al.’s work [18] where they select varying feature subsets for test instance using neighbourhood in- In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, formation of the training data. While calculating the neighborhood California, USA, August 24, 2020. Use permitted under Creative Commons License information from training data is similar to building clusters in Attribution 4.0 International (CC BY 4.0). our approach, the training neighborhood for our method is on just KiML’20, August 24, 2020, San Diego, California, USA, © 2020 Copyright held by the author(s). the observed feature space. Moreover, we incorporate the neigh- https://doi.org/10.1145/nnnnnnn.nnnnnnn bourhood information in the training algorithm whereas Nan et KiML’20, August 24, 2020, San Diego, California, USA, Srijita Das, Rishabh Iyer, and Sriraam Natarajan al.’s work is a prediction time algorithm. Ma et al. [10] also address observed features to find similar instances in the training set and this problem of dynamic discovery of features based on generative identify the important feature subsets for each of these clusters modelling and Bayesian experimental design. based on a feature selector objective function which balances the Feature elicitation using Reinforcement learning: There is trade-off between choosing the important features and the cost at another line of work along the sequential decision making liter- which these features are acquired. ature [4, 9, 22] to model the test time elicitation of features by learning the optimal policy of test feature acquisition. Along this direction, our work aligns with the work of Shim et al. [25] where 3.2 Proposed solution they jointly train a classifier and RL agent together. Their classifier As a first step, we cluster the training instances based on just the objective function is similar to our method with a cost penalty, observed zero cost feature set O. The intuition is that instances however they use a Deep RL agent to figure out the policy. We on with similar features will also have similar characteristics in terms the other hand use localised feature selector to find the important of which elicitable features to order. For example, in a medical appli- feature subsets for the underlying training clusters in the observed cation, whether to request for a blood test or a ct-scan will depend feature space. on factors such as age, gender, ethnicity and whether patients with Active Feature Acquisition: Our problem set-up is also inspired similar demographic features had requested these tests. Also, since by work along active feature acquisition [13, 14, 19, 23, 29] where the feature set O, comes at zero cost, we assume that for unseen certain feature subsets are observed and rest are acquired at a cost. test instances, this feature set is observed. While all the above mentioned work follow this problem set up during training time and typically use active learning to seek infor- mative instances at every iteration, we use this particular setting for test instances. Unlike their work, all the training instances in our work are fully observed and the assumption is that the feature acquisition cost has already being paid during training. Also, we address a supervised classification problem instead of an active learning set up. Our problem set up is similar to Kanani et al. [6] as they also have partial test instances, however their problem is that of instance acquisition where the acquired feature subset is fixed. Figure 1: Optimization framework for the proposed problem Our method aims at discovering variable length feature subsets for various underlying clusters. Our contributions: Although the problem of prediction time fea- ture elicitation has been explored in literature from various direc- tions and with various assumptions, we come up with an intu- itive solution to this problem and formulate the problem in a two We propose a model which consists of a parameterized feature step optimization framework. We incorporate acquisition cost selector module 𝐹 (𝑋, E𝑐𝑖 , 𝛼) which takes in a set of input instances in both the feature selector and model objectives to balance the 𝐸𝑐𝑖 belonging to the cluster 𝑐𝑖 based on the feature set O and pro- performance and cost trade-off. The problem set up is naturally duces a subset 𝑋 of most important features for the classification applicable in real world health care and other domains where the task. The feature selection model is based on an information- theo- knowledge of the observed features also needs to be accounted retic objective function and is augmented with the feature cost to while selecting the elicitable features . account for the trade off between model performance and acquisi- tion cost at test-time. The output feature subset from the feature 3 COST AWARE FEATURE ELICITATION selector module are used to update the parameters of the classifier. The optimization framework is shown in Figure 1 3.1 Problem setup Information theoretic Feature selector model: The feature Given: A dataset {(𝑥 1, 𝑦1 ), · · · , (𝑥𝑛 , 𝑦𝑛 )} with each 𝑥𝑖 ∈ R𝑑 as the selector module selects the best subset of features for each cluster feature set. Each feature has an associated cost 𝑟𝑖 . of training data based on an information theoretic objective score. Objective: Learn a discriminative model which is aware of the fea- Since at test time, we do not know the elicitable feature subset E ture costs and can balance the trade-off between feature acquisition (since the goal of feature selection is in the first place to find the cost and model performance. truly necessary features for learning). Hence we propose to use the We make an additional assumption here that there is a subset of fea- closest set of instances in the training data to the current instance. tures which have 0 cost. These could be, for example, demographic Since we assume that the training data has already been elicited, information (e.g. age, gender, etc) in a medical domain which are we have all the features observed in the training data. We compute easily available/less cumbersome to obtain as compared to other this distance just based on the observed feature set O. We cluster features. In other words, we can partition the feature set F = O ∪ E the training data based on the observed features into m clusters where O are the zero cost observed features and E are the elicitable 𝑐 1, 𝑐 2, · · · 𝑐𝑚 . Next, we use the Minimum-Redundancy-Maximum features which can be acquired at a cost. We also assume that the Relevance (MRMR) feature Selection paradigm [1, 21]. We denote training data is completely available with all features (i.e. the cost parameters [𝛼𝑐1𝑖 , 𝛼𝑐2𝑖 , 𝛼𝑐3𝑖 , 𝛼𝑐4𝑖 ] as parameters of a particular cluster for all the features has already been paid). The goal is to use these 𝑐𝑖 . The feature selection module is a function of the parameters of KiML’20, August 24, 2020, San Diego, California, USA, Cost Aware Feature Elicitation the cluster to which a set of instances belong and is defined as: where 𝜆1 and 𝜆2 are hyper-parameters. In the above equation, 𝜃 Õ is the parameter of the model and can be updated by standard 𝐹 (𝑋, E𝑐𝑖 , 𝛼𝑐𝑖 ) = 𝛼𝑐1𝑖 𝐼 (E𝑝 ; 𝑌 ) gradient based techniques. This loss function takes into account the E𝑝 ∈𝑋 important feature subset for each cluster and updates the parameter accordingly. The classifier objective also consists of a cost term | {z } max. relevance denoted by 𝑐 (𝑋𝛼𝑖 ) to account for the cost of the selected feature subset. For hard budget on the elicited features, the cost component Õ © Õ Õ − 𝛼𝑐2𝑖 𝐼 (E 𝑗 ; E𝑝 ) − 𝛼𝑐3𝑖 𝐼 (E𝑝 ; E 𝑗 |𝑌 ) ® ª E𝑝 ∈𝑋 « E 𝑗 ∈𝑋 E 𝑗 ∈𝑋 (1) in the model objective can be considered. In case of a cost budget, this component can be ignored because the elicited feature subset ¬ | {z } Õ min. redundancy adheres to a fixed cost and hence, this term is constant. − 𝛼𝑐4𝑖 𝑐 (E𝑝 ) E𝑝 ∈𝑋 3.3 Algorithm | {z } We present the algorithm for Cost Aware Feature Elicitation cost penalty (CAFE) in Algorithm 1. CAFE takes as input set of training examples where 𝐼 (E𝑝 ; 𝑌 ) is the mutual information between the random vari- E, the zero cost feature set O, the elicitable feature subset E, a cost able E𝑝 (feature) and 𝑌 (target). In the above equation, the feature vector 𝑀 ∈ R𝑑 and a budget 𝐵. Each element in the training set E subset 𝑋 is grown greedily using a greedy optimization strategy consists of a tuple (𝑥, 𝑦) where 𝑥 ∈ R𝑑 is the feature vector and y maximizing the above objective function. In equation 1, E𝑝 denotes is the label. a single feature from the elicitable set E that is considered for eval- The training instances E are clustered based on just the observed uation based on the subset 𝑋 grown so far. The first term is the feature set O using K-means clustering (Cluster). For every cluster mutual information between each feature and the class variable 𝑌 . 𝑐𝑖 , the training instances belonging to the cluster is assigned to In a discriminative task, this value should be maximized. The sec- the set E𝑐𝑖 and is passed to the Feature Selector module (lines 6-8). ond term is the pairwise mutual information between each feature The FeatureSelector function takes E𝑐𝑖 , parameter 𝛼, the feature to be evaluated and the features already added to the feature subset subsets O and E, cost vector 𝑀 and a predefined budget 𝐵 as input 𝑋 . This value needs to be minimized for selecting informative fea- and returns the most important feature subset X𝑐𝛼𝑖 corresponding tures. The third term is called the conditional redundancy [1] and to a cluster 𝑐𝑖 . A greedy optimization technique is used to grow this term needs to be maximized. The last term adds the penalty the feature subset 𝑋 of every cluster based on the feature selector for cost of every feature and ensures the right trade-off between objective function defined in Equation 1. The FeatureSelector cost, relevance and redundancy. For this work, we do not learn the terminates once the budget 𝐵 is exhausted or the mutual informa- parameters 𝛼𝑐𝑖 for each cluster, instead fix these parameters to 1. tion score becomes negative. Once all the important feature subsets We leave the learning of these parameters to future work. are obtained for all the |𝐶 | clusters, the model objective function is In the problem setup, since the 0 cost feature subset is always optimized as mentioned in Equation 3 for all the training instances present, we always consider the observed feature subset O in ad- using the important feature subsets for the clusters to which the dition to the most important feature subset as returned by the training instances belong (lines 12-18). All the remaining features Feature selector objective. We also account for the knowledge of are imputed by using either 0 or any other imputation model be- the observed features while growing the informative feature subset fore training the model. The final training model G(E O∪𝑋𝛼 , 𝛼, 𝜃 ) through greedy optimization. Specifically, while calculating the is an unified model used to make predictions for a test-instance pairwise mutual information between the features and the condi- consisting of just the observed feature subset O. tional redundancy term (second and third term of equation 1), we also evaluate the mutual information of the features with these 4 EMPIRICAL EVALUATION observed features. It is to be noted that in cases where the observed We did experiments with 3 real world medical data sets. The in- features are not discriminative enough of the target, the feature se- tuition of CAFE makes more sense in medical domains, hence our lector module ensures that the elicitable features with maximum choice of data sets. However, the idea can be applied to other do- relevance to the target variable are picked. mains ranging from logistics to resource allocation task. Table 2 Optimization Problem: The cost aware feature selector jots down the various features of the data sets used in our experi- 𝐹 (𝑋, E𝑐𝑖 , 𝛼) for a given set of instance E𝑐𝑖 belonging to a specific ments. Below are the details of the 3 real data sets, we use for our cluster 𝑐𝑖 solves the following optimization problem: experiments. 𝑋𝛼𝑖 = argmax𝑋 ⊆ E 𝐹 (𝑋, E𝑐𝑖 , 𝛼) (2) 1. Parkinson’s disease prediction: The Parkinson’s Progression Marker Initiative (PPMI) [12] is an observational study where the For a given instance (𝑥, 𝑦), we denote 𝐿(𝑥, 𝑦, 𝑋, 𝜃 ) as the loss aim is to identify Parkinson’s disease progression from various function using a subset 𝑋 of the features as obtained from the types of features. The PPMI data set consists of various features Feature selector optimization problem. The optimization problem related to various motor functions and non-motor behavioral and for learning the parameters of a classifier can be posed as: psychological tests. We consider certain motor assessment features 𝑛 like rising from chair, gait, freezing of gait, posture and postural sta- bility as observed features and rest all features as elicitable features Õ min 𝐿(𝑥𝑖 , 𝑦𝑖 , 𝑋𝛼𝑖 , 𝜃 ) + 𝜆1𝑐 (𝑋𝛼𝑖 ) + 𝜆2 ||𝜃 || 2 (3) 𝜃 𝑖=1 which must be acquired at a cost. KiML’20, August 24, 2020, San Diego, California, USA, Srijita Das, Rishabh Iyer, and Sriraam Natarajan Algorithm 1 Cost Aware Feature Elicitation 1: function CAFE(E, O, E, 𝑀, 𝐵) 2: E = E O∪E ⊲ E consists of 0 cost features O and costly features E 3: 𝐶 = Cluster(E O ) ⊲ Clustering based on the observed features O 4: X = {∅} ⊲ Stores best feature subsets of each cluster 5: for 𝑖 = 1 to |𝐶 | do ⊲ Repeat for every cluster 6: E𝑐𝑖 = GetClusterMember(E, 𝐶, 𝑖) 7: ⊲ get the data points belonging to each cluster 𝑐𝑖 8: X𝑐𝛼𝑖 = FeatureSelector(E𝑐𝑖 , 𝛼, O, E, 𝑀, 𝐵) 9: ⊲ Parameterized feature selector for each cluster 10: X = X ∪ {X𝑐𝛼𝑖 ∪ O} 11: end for 12: for 𝑖 = 1 to |𝐶 | do ⊲ Repeat for every cluster 13: X𝑐𝛼𝑖 = GetFeatureSubset(X, 𝑖) Figure 2: Recall Vs number of clusters for Rare disease for 14: ⊲ Get the feature subset for each cluster 𝑐𝑖 CAFE-I 15: for 𝑗 = 1 to |E𝑐𝑖 | do ⊲ Repeat for every data point in cluster 𝑐𝑖 16: Optimize 𝐽 (𝑥 𝑗 , 𝑦 𝑗 , X𝑐𝛼𝑖 , 𝜃, 𝑀) and built upon it. We consider two variants of CAFE:(1) CAFE in 17: ⊲ Optimize the objective function in Equation 3 which we replace the missing and unimportant features of every 18: Update 𝜃 ⊲ Update the model parameter 𝜃 cluster with 0 and then update the classifier parameters (2) CAFE-I 19: end for where we replace the missing and unimportant features by using an 20: end for imputation model learnt from the already acquired feature values return G(E O∪𝑋𝛼 , 𝛼, 𝜃 ) ⊲ G is the training model built on E of other data points. A simple imputation model is used where we 21: end function replace the missing features with mode for categorical features and mean for numeric features. Baselines: We consider 3 baselines for evaluating CAFE and 2. Alzheimer’s disease prediction: The Alzheimer’s Disease Neu- CAFE-I: (1) using the observed and zero cost features to update roIntiative (ADNI1 ) is a study that aims to test whether various the training model denoted as OBS (2) using a random subset of clinical, FMRI and biomarkers can be used to predict the early onset fixed number of elicitable features and all the observed features of Alzheimer’s disease. In this data set, we consider the demograph- to update the training model denoted as RANDOM. For this baseline, ics of the patients as observed and zero cost features and the FMRI the results are averaged over 10 runs. (3) using the information image data and cognitive score data as unobserved and elicitable theoretic feature selector score as defined in Equation 1 to select features. the ’k’ best elicitable features on the entire data without any cluster 3. Rare disease prediction This data set is created from survey consideration along with the observed features denoted as KBEST. questionnaires [11] and the task here is to predict whether a person We keep the value of ’k’ to be the same as that used by CAFE. has rare disease or not. The demographic features are observed Although some of the existing methods could be potential baselines, while other sensitive questions in the survey regarding technology none of these methods match the exact setting of our problem, hence use, health and disease related meta information is considered to we do not compare our method against them. be elicitable. Results: We aim to answer the following questions: Evaluation Methodology: All the data sets were partitioned Q1: How does CAFE and CAFE-I with hard budget on features into a 80:20 train-test split. Hyper parameters like the number of compare against the standard baselines? clusters on the observed features were picked by doing 5 fold cross Q2: How does the cost-sensitive version of CAFE and CAFE-I validation on all the data sets. The optimal number of clusters fare against the cost-sensitive baseline KBEST? picked were 6 for ADNI, 9 for Rare disease data set and 7 for the PPMI data set. For the results reported in Table 1, we considered a The results reported in Table 1 suggests both CAFE and CAFE- hard budget on the number of elicitable features and set it to half I significantly outperform the other baselines in almost all the of the total number of features in the respective data set. We use K- metrics for Rare disease and PPMI data set. For ADNI, CAFE and means clustering as the underlying clustering algorithm. For all the CAFE-I outperform the other baselines in clinically relevant recall reported results, we use an underlying Support Vector Machine [3] metric while KBEST performs the best for the other metrics. The classifier with Radial basis kernel function. Since, all the data sets reason for this is that in ADNI, since, the elicitable features are are highly imbalanced, hence we consider metrics like recall, F1, image features and we discretize the image features to calculate AUC-ROC and precision for our reported results. For the Feature the information gain for the feature selector module, the granular selector module, we used the existing implementation of Li et al. [7] level feature information is lost because of this discretization and hence the drop in performance. For the experiments in Table 1, 1 www.loni.ucla.edu/ADNI we keep the budget to be approximately half of the total number KiML’20, August 24, 2020, San Diego, California, USA, Cost Aware Feature Elicitation Data set Algorithm Recall F1 AUC-ROC AUC-PR OBS 0.647 0.488 0.642 0.347 RANDOM 0.57 ± 0.064 0.549± 0.059 0.693 ± 0.042 0.421 ± 0.051 Rare disease KBEST 0.47 0.457 0.628 0.349 CAFE 0.647 0.628 0.749 0.489 CAFE-I 0.647 0.647 0.759 0.512 OBS 0.765 0.685 0.741 0.563 RANDOM 0.857 ± 0.023 0.809 ± 0.015 0.85 ± 0.013 0.712 ± 0.020 PPMI KBEST 0.828 0.807 0.846 0.716 CAFE 0.846 0.817 0.855 0.726 CAFE-I 0.855 0.829 0.865 0.743 OBS 0.5 0.44 0.553 0.365 RANDOM 0.711 ± 0,043 0.697 ± 0.082 0.767 ± 0.064 0.592 ± 0.098 ADNI KBEST 0.73 0.745 0.806 0.646 CAFE 0.807 0.711 0.786 0.578 CAFE-I 0.769 0.701 0.776 0.574 Table 1: Comparison of CAFE against other baseline methods on 3 real data sets Dataset # Pos # Neg # Observed # Elicitable 5 CONCLUSION PPMI 554 919 5 31 ADNI 94 287 6 69 In this paper, we pose the prediction time feature elicitation problem Rare Disease 87 232 6 63 as an optimization problem by employing a cluster specific feature Table 2: Data set details of the 3 real data sets used.#Pos is num- selector to choose the best feature subset and then optimizing the ber of positive example, #Neg is number of negative example. # Ob- training loss. We show the effectiveness of our approach in real data served is number of observed features and # Elicitable is the maxi- sets where the problem set up is intuitive. Future work includes mum number of features that can be acquired. learning the parameters of the feature selector module and jointly optimizing the feature selector and model parameters for a more robust framework and adding more constraints to optimization. of features for all the methods. On an average, CAFE-I performs ACKNOWLEDGEMENTS better than CAFE across all the data sets because of the underlying SN & SD gratefully acknowledge the support of NSF grant IIS- imputation model which helps in better treatment of the missing 1836565. Any opinions, findings and conclusion or recommenda- values as against replacing all the features by 0. This answers Q1 tions are those of the authors and do not necessarily reflect the affirmatively. view of the US government. In Figure 3, we compare the cost version of CAFE and CAFE-I against KBEST. Cost version takes into account the cost of individ- ual features and accounts for them as penalty in the feature selector REFERENCES [1] Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. 2012. Conditional module. Hence, in this version of CAFE, a cost budget is used as likelihood maximisation: a unifying framework for information theoretic feature opposed to hard budget on the number of elicitable features. We gen- selection. JMLR (2012). erate the cost vector by sampling each cost component uniformly [2] Xiaoyong Chai, Lin Deng, Qiang Yang, and Charles X Ling. 2004. Test-cost sensitive naive bayes classification. In ICDM. from (0,1). For PPMI and Rare disease, we can see that cost sensitive [3] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine CAFE performs consistently better than KBEST with increasing learning (1995). cost budget. In the PPMI data set, the greedy optimization of the [4] Gabriel Dulac-Arnold, Ludovic Denoyer, Philippe Preux, and Patrick Gallinari. 2011. Datum-wise classification: a sequential approach to sparsity. In ECML feature selector objective on the entire data set lead to elicitation of PKDD. 375–390. just 1 feature, beyond that the information gain was negative, hence [5] Tianshi Gao and Daphne Koller. 2011. Active classification based on value of classifier. In NIPS. the performance of PPMI across various cross budget remains the [6] P. Kanani and P. Melville. 2008. Prediction-time active feature-value acquisition same. CAFE on the other hand was able to select important feature for cost-effective customer targeting. Workshop on Cost Sensitive Learning at subsets for various clusters based on the observed features related NIPS (2008). [7] Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino, to gait and postures. For ADNI data set, CAFE performs better than Jiliang Tang, and Huan Liu. 2018. Feature selection: A data perspective. ACM KBEST only in recall. The reason for this is the same as mentioned Computing Surveys (CSUR) (2018). above. This helps in answering Q2 affirmatively. [8] Charles X Ling, Qiang Yang, Jianning Wang, and Shichao Zhang. 2004. Decision trees with minimal costs. In ICML. Lastly, Figure 2 shows the effect of increasing cluster on the [9] D. J. Lizotte, O. Madani, and R. Greiner. 2003. Budgeted learning of Naive-Bayes validation recall for the Rare disease data set. As can be seen, for classifiers (UAI). 378–385. [10] Chao Ma, Sebastian Tschiatschek, Konstantina Palla, Jose Miguel Hernandez- smaller number of clusters, the recall is very low and increases to Lobato, Sebastian Nowozin, and Cheng Zhang. 2019. EDDI: Efficient Dynamic an optimum for 9 clusters. This helps us in understanding the fact Discovery of High-Value Information with Partial VAE. In ICML. that forming clusters based on observed important features helps [11] H. MacLeod, S. Yang, et al. 2016. Identifying rare diseases from behavioural data: a machine learning approach (CHASE). 130–139. CAFE in selecting different feature subsets for different clusters, [12] K. Marek, D. Jennings, et al. 2011. The Parkinson Progression Marker Initiative thus helping the learning procedure. (PPMI). Prog Neurobiol 95, 4 (2011), 629–635. KiML’20, August 24, 2020, San Diego, California, USA, Srijita Das, Rishabh Iyer, and Sriraam Natarajan Figure 3: Recall (left), F1 (middle), AUC-PR (right) for (from top to bottom) Rare Disaese, PPMI, and ADNI. The x-axis refers to the cost budget used which leads to the elicitation of different number of features. [13] P. Melville, M. Saar-Tsechansky, et al. 2004. Active feature-value acquisition for (2005), 1226–1238. classifier induction (ICDM). 483–486. [22] Thomas Rückstieß, Christian Osendorfer, and Patrick van der Smagt. 2011. Se- [14] P. Melville, M. Saar-Tsechansky, et al. 2005. An expected utility approach to quential feature selection for classification. In Australasian Joint Conference on active feature-value acquisition (ICDM). 745–748. Artificial Intelligence. Springer, 132–141. [15] Feng Nan and Venkatesh Saligrama. 2017. Adaptive classification for prediction [23] M. Saar-Tsechansky, P. Melville, and F. Provost. 2009. Active feature-value under a budget. In NIPS. acquisition. Manag Sci 55, 4 (2009). [16] Feng Nan, Joseph Wang, and Venkatesh Saligrama. 2015. Feature-budgeted [24] Victor S Sheng and Charles X Ling. 2006. Feature value acquisition in testing: a random forest. In ICML. sequential batch test algorithm. In ICML. [17] Feng Nan, Joseph Wang, and Venkatesh Saligrama. 2016. Pruning random forests [25] Hajin Shim, Sung Ju Hwang, and Eunho Yang. 2018. Joint active feature acquisi- for prediction on a budget. In NIPS. tion and classification with variable-size set encoding. In NIPS. [18] Feng Nan, Joseph Wang, Kirill Trapeznikov, and Venkatesh Saligrama. 2014. Fast [26] Joseph Wang, Kirill Trapeznikov, and Venkatesh Saligrama. 2015. Efficient learn- margin-based cost-sensitive classification. In ICASSP. ing by directed acyclic graph for resource constrained prediction. In NIPS. [19] Sriraam Natarajan, Srijita Das, Nandini Ramanan, Gautam Kunapuli, and Predrag [27] Zhixiang Xu, Matt Kusner, Kilian Weinberger, and Minmin Chen. 2013. Cost- Radivojac. 2018. On Whom Should I Perform this Lab Test Next? An Active sensitive tree of classifiers. In ICML. Feature Elicitation Approach.. In IJCAI. [28] Zhixiang Xu, Kilian Q Weinberger, and Olivier Chapelle. 2012. The greedy miser: [20] S. Natarajan, A. Prabhakar, et al. 2017. Boosting for postpartum depression learning under test-time budgets. In ICML. prediction (CHASE). 232–240. [29] Z. Zheng and B. Padmanabhan. 2002. On active learning for data acquisition [21] Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based (ICDM). 562–569. on mutual information criteria of max-dependency, max-relevance, and min- redundancy. IEEE Transactions on pattern analysis and machine intelligence 27, 8 A New Delay Differential Equation Model for COVID-19 Retarded logistic equation B Shayak† Mohit M Sharma Manas Gaur Mechanical and Aerospace Engg Population and Health Sciences AI Institute Cornell University Weill Cornell Medicine University of South Carolina Ithaca, New York State, USA New York City, USA USA sb2344@cornell.edu mos4004@med.cornell.edu mgaur@email.sc.edu ABSTRACT (homogeneous mixing etc). The second option affords maximum potential versatility at the cost of huge computational complexity In this work we give a delay differential equation, the retarded and variability in the network structure. The third option logistic equation, as a mathematical model for the global combines features of the previous two – whether the features transmission of COVID-19. This model accounts for asymptomatic being synergized are the positive or the negative ones depends to carriers, pre-symptomatic or latent transmission as well as contact a large extent on the modeler. tracing and quarantine of suspected cases. We find that the In this work we use delay differential equations (DDE) to equation admits varied classes of solutions including self-burnout, propose a simple, single-variable, lumped parameter model for the progression to herd immunity and multiple states in between. We spread of Coronavirus. Jahedi and Yorke [1] make a strong case use the term “partial herd immunity” to refer to these states, for simpler models relative to complex and elaborate ones. In the where the disease ends at an infection fraction which is not Literature, DDE has been used for modeling COVID-19, for negligible but is significantly lower than the conventional herd example in Refs. [2]–[4]. These authors however ignore features immunity threshold. We believe that the spread of COVID-19 in such as contact tracing, asymptomatic carriers and latent every localized area can be explained by one of our solution transmission; our results too have a richer structure. classes. CCS CONCEPTS 2 Derivation of the model • Applied computing – mathematics and statistics We measure time t in days and use as our basic variable y(t) which is the cumulative number of corona cases, including active KEYWORDS cases, recovered cases and deaths, in the region of interest. The following “word-equation” summarizes the approach : Retarded logistic equation, Asymptomatic carriers, Latent transmission, Contact tracing, Reproduction number calculation, Rate of emergence = Interaction rate of Partial herd immunity of new cases each existing case (0) 1 Introduction Probability of Number of Three kinds of models to study COVID-19 are currently in transmission existing cases vogue – lumped parameter or compartmental models (ordinary The left hand side (LHS) here is just dy/dt whereas the right differential equation), agent-based models and stochastic hand side (RHS) needs a detailed derivation. differential equation models. The first option affords maximum Equation (0) assumes that the disease is transmitted from conceptual clarity at the expense of some simplifying assumptions infected to susceptible people via interaction, and not via airborne †Presenting author, Corresponding author. ORCID : 0000-0003-2502-2268 transmission. Due to asymptomatic and pre-symptomatic carriers, there are always cases moving about in society who are oblivious In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the to their infectivity. Each such case interacts with other people at Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San a different rate. For example, a working-from-home professor Diego, California, USA, August 24, 2020. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). might venture outside once every three days and interact with one person on each trip while a grocer might go to work and interact KiML'20, San Diego, California, USA with 10 customers every day. The professor has an interaction rate © 2020, Copyright held by the author(s). of 1/3 persons/day while the grocer has interaction rate of 10 KDD KiML 2020 Shayak et. al. persons/day. For a compartmental model, one must average over contact tracing drive conducted by public health department is the professor, the grocer and all the other un-quarantined cases to taken into account. Assumption is made that this drive is generate an effective per-case interaction rate q0. instantaneous and proceeds in forward direction starting from Every interaction of course does not result in a transmission – freshly arriving symptomatic cases. The contact trace captures there is a probability strictly less than unity that the virus jumps patients who were exposed to the new case τ2 days ago, as well as from the infected person to the person whom s/he is interacting patients who were exposed immediately before the new case with. This probability has two components. The first component manifested symptoms. The average duration for which these is that the healthy person must be susceptible to begin with. While secondary patients have remained at large is τ2/2, be they we ignore intrinsic insusceptibles, there will be people who have symptomatic or asymptomatic. The assumption of instantaneous recovered from the disease and are therefore not susceptible contact tracing, which decreases the average time that contact- again. In this Article, we assume that one bout of infection brings traced cases spend out of quarantine, opposes the error arising permanent immunity. The assumption is valid so long as the from the assumption of zero non-transmissible incubation period, immunity period exceeds the total epidemic duration. Till date, which increases the average time for which the contact-traced there is little credible evidence for re-infection [5]–[7]; contrarily, cases transmit before quarantine. These two effects are assumed a very recent and thorough study [8] based on monitoring of huge here to cancel. Let μ3 (between 0 and 1) denote the fraction of all patient cohort has found significant evidence of long-lasting and cases who escape from contact tracing drives – the effective antibodies. If N be the initial number of susceptible complementary fraction 1−μ3 get caught. Thus, we have three people (recall that y is the case count), then the probability that a classes of un-quarantined cases : (a) 1−μ3 are contact-traced cases random person is a recovered case is approximately y/N and the who remain in society for a time τ2/2, (b) μ3 (1−μ1) are untraced probability that s/he is susceptible is (approximately) 1−y/N. This symptomatic cases who go into isolation only after time τ2, and (c) expression is approximate because the true number of recovered μ3μ1 are undetected asymptomatic cases who transmit for the cases at any time is less than y; the error however is small since entire infection period τ1. Arguments similar to those of the the recovery period is much shorter than the overall course of the previous paragraph yield the total number of un-quarantined epidemic. Note that 1−y/N is a logistic term, and a herd immunity cases as effect. Given susceptibility, the next probability is that the virus n = ( 1 − μ3 )( y − y ( t − τ 2 / 2) ) + actually does jump from the un-quarantined case to the . (1) susceptible person. This probability depends on the level of ( (1 − μ ) μ )( y − y ( t − τ ) ) + μ μ ( y − y ( t − τ ) ) 1 3 2 1 3 1 precaution such as face covering or mask, handwashing and The preceding arguments now yield the mathematical form of disinfection being adopted by the case as well as the susceptible (0) as person. For a compartmental model, the probability must be averaged over all the un-quarantined cases. If this average dy y y ( t ) − ( 1 − μ3 ) y ( t − τ 2 / 2) − = m0 1 − probability is P0, then q0(1−y/N)P0 gives the per-case spreading dt (1 − μ ) μ y ( t − τ ) − μ μ y ( t − τ ) N 1 3 2 1 3 1 rate. Since q0 and P0 are both dependent on public health measures, and are both difficult to measure independently, we can (2) club those two together into a single parameter which we call m0. which is the retarded logistic equation. So far we have accounted for the rate at which each cases spreads the disease; now we have to count the number of cases 3 Solutions of the model out of quarantine. Let us start with an asymptomatic carrier, who remains in open society throughout. S/he typically transmits the Due to complexity of the equation (2), analytical solution using disease for 7 days, which is called the infection period. Then, new perturbation theory etc has not been attempted in this case. healthy people can be only be infected by those asymptomatic Instead we have used numerical integration to obtain the cases who have fallen sick within the last 7 days, and not those solutions of (2). Before giving the solutions however, we present who have fallen sick earlier. The number of such people is the the calculation of the reproduction number R. To find R at any number of asymptomatic sick people today minus the number of state of evolution of the disease, we first treat y in the logistic term those 7 days earlier. Mathematically, let μ1 (between 0 and 1) to be constant, and then carry out the steps described in Ref. [9]. This yields the expression denote the fraction of asymptomatic carriers and τ1 the asymptomatic infection period. Then, the number of asymptomatic transmitters today is μ1(y(t)−y(t−τ1)). Here we can see the emergence of the delay term. R = m0 1 − ( )( N y 1 + μ3 − 2 μ1 μ3 2 τ 2 + μ1 μ3 τ 1 ) . (3) The ease of calculating R with respect to the ordinary The remaining fraction 1−μ1 of cases are symptomatic. Let τ2 differential equation based models [10] is noteworthy. be the latency period during which these cases remain Solution classes of logistic DDE (2) are now demonstrated. The transmissible prior to displaying symptoms. It is assumed that numerical integration routine used is second order Runge Kutta they isolate themselves thereafter. Assumption is also made that with a time step of 1/1000 day. As the testbed for the simulations, the incubation period is equal to the latency period. Finally, the we consider a Notional City having N=300000, μ1=0.8, (maximum Retarded logistic equation KDD KiML 2020 value as per our knowledge [11]–[13]), τ1=7 days and τ2=3 days there spiraled out of control despite hard lockdowns being [14]. The initial condition needs to be a function having the length imposed at an early stage. of the maximum delay involved in the problem, which is seven City B also enables us to explain partial herd immunity. Even days; we take this function to be zero cases to start with and though the initial conditions were unfavourable for containment constant increase of 100 cases/ day for a week. of the epidemic, herd immunity started activating as the disease Notional City A has m0=0.23 and μ3=1/2, which describes a proliferated. A stable zone (R<1) was entered when only 13.5 hard lockdown [15] accompanied by good contact tracing. R0 (i.e. percent of the total susceptible population was infected, and a (3) evaluated at y=0) is 0.886. The epidemic ends with a negligible similar percentage again got infected before the epidemic ended. fraction of infected people, as shown below. This and the next five Thus, herd immunity worked in synergy with non- plots are three-way – each plot shows y as blue line, its derivative pharmaceutical interventions to stop the epidemic at only 26 y as green line and the weekly increments in cases, or percent infection level, which is significantly less than the epidemiological curve, as a grey bar chart. These last have been conventional 70-90 percent threshold [16]. This is what we call reduced by a factor of 7 to ensure clarity of presentation. We partial herd immunity. Our findings are in agreement with and act report the rates on the left hand side y-axis and the cumulative as an explanation for what has been obtained by Britton et. al. [17] cases on the right hand side y-axis. and Peterson et. al. [18]. We now consider Notional City C which differs from City B in that m0=0.5; lockdown is replaced by a much more permissive state. R0 is above 2.5; 1,80,000 infections are required to bring it below unity. Figure 1 : City A extinguishes the epidemic in time. This is exactly what has happened in New Zealand – that il fortunatissimo per verita has indeed quashed the epidemic completely with the final case count being a negligible fraction of Figure 3 : City C goes to herd immunity – total not its total (tiny and sparsely distributed) population. partial. The symbol ‘k’ denotes thousand and ‘L’ hundred The parameter values for Notional City B are the same as those thousand. for A except that μ3=0.75; a greater fraction of cases escape the contact tracing drive. R0 is 1.16, and R becomes 1 at y=40500 cases. Need one mention that this is a public health disaster. Notional City D combines features of B and C. This city begins with m0=0.5 like City C but reduces to m0=0.23 like City B when the case count reaches 40,000 (the R=1 threshold for B’s parameters). Figure 2 : City B grows at first before reaching burnout. The symbol ‘k’ denotes thousand. The outbreak enters exponential regime right after being Figure 4 : As the input, so the output – D’s response released. As y increases, R gradually reduces so the growth slows combines features of B and C. The symbol ‘k’ denotes down until it peaks when the case count is about 39,000 [compare thousand and ‘L’ hundred thousand. with the value of 40,500 when R=1 as per (3)]. Thereafter, the disease progresses to extinction in time. The overall progression We can see a case count as well as a total duration intermediate is very long but one hopes that the relatively small size of the peak to B and C; the epidemic is over in 70 days but the peak rate of can prevent overstressing of medical care facilities and thus avoid 12,920 cases/day is still very high and likely to load hospital unnecessary deaths. Delhi and Mumbai in India and Los Angeles facilities beyond their carrying capacity. in USA are in all probability cities of this type since the disease The Cities E and F demonstrate the issues faced in reopening. In both these cities, the parameters and case trajectory are KDD KiML 2020 Shayak et. al. identical to those of City A for the first 80 days. Then, E and F forecasting transmission and control of COVID-19,” medRxiv, p. 2020.05.06.20092858, 2020, doi: 10.1101/2020.05.06.20092858. reopen on the 80th day by increasing m0 from 0.23 to 0.5, and [4] J. Mendenez, “Elementary time-delay dynamics of COVID-19 disease,” simultaneously decreasing μ3 i.e. deploying a more effective Medrxiv, pp. 1–4, 2020, doi: https://doi.org/10.1101/2020.03.27.20045328. [5] D. C. Ackerly, “Getting COVID-19 twice.” VOX, [Online]. Available: contact tracing program which had been built up during the https://www.vox.com/2020/7/12/21321653/getting-covid-19-twice- lockdown. The post-reopening μ3’s for E and F are 0.1 and 0.2 reinfection-antibody-herd-immunity. respectively. [6] S. McCamon, “13 USS Roosevelt Sailors Test Positive For COVID-19, Again.” [7] Y. Saplakoglu, “coronavirus-reinfections-were-false-positives.” [Online]. Available: https://www.livescience.com/coronavirus-reinfections-were- false-positives.html. [8] A. Wajnberg et al., “SARS-CoV-2 infection induces robust, neutralizing antibody responses that are stable for at least three months,” medRxiv, 2020, doi: https://doi.org/10.1101/2020.07.14.20151126. [9] B. Shayak and R. H. Rand, “Self-burnout - A New Path to the End of COVID-19,” medRxiv, pp. 1–14, 2020, doi: https://doi.org/10.1101/2020.04.17.20069443. [10] O. Diekmann, J. A. P. Heesterbeek, and M. G. Roberts, “The construction of next-generation matrices for compartmental epidemic models,” J. R. Soc. Interface, vol. 7, no. 47, pp. 873–885, 2010, doi: 10.1098/rsif.2009.0386. [11] “71 percent of patients in Maharashtra are asymptomatic.” Mumbai Figure 5 : City E, like City A, is a success story. Mirror, [Online]. Available: https://mumbaimirror.indiatimes.com/coronavirus/news/covid-19-71-of- patients-in-maharashtra-are-asymptomatic-mumbai-cases-at- 16579/articleshow/75754328.cms. [12] “Taking over hospital beds, conducting survey.” New Indain Express, [Online]. Available: https://www.newindianexpress.com/nation/2020/may/30/taking-over- hospital-beds-conducting-survey-uddhav-government-goes-after-covid- 19-as-state-tally-c-2149989.html. [13] “Delhi CM says COVID-19 deaths very less.” Times of India, [Online]. Available: https://timesofindia.indiatimes.com/city/delhi/delhi-cm-says- covid-19-deaths-very-less-but-75pc-cases-asymptomatic-or-showing- mild-symptoms/articleshow/75658636.cms. [14] M. L. Childs et al., “The impact of long-term non-pharmaceutical interventions on COVID-19 epidemic dynamics and control,” medRxiv, vol. 22, p. 2020.05.03.20089078, 2020, doi: 10.1101/2020.05.03.20089078. [15] B. Shayak and M. M. Sharma, “Retarded Logistic Equation as a Universal Figure 6 : Unlike City E, F is a failure story. The symbol Dynamic Model for the Spread of COVID-19,” medRxiv, pp. 1–27, 2020, ‘k’ denotes thousand and ‘L’ hundred thousand. doi: 10.1101/2020.06.09.20126573. [16] G. A. D’Souza and D. Dowdy, “What is herd immunity and how we can achieve it with COVID-19 ?” [Online]. Available: The difference between Cities E and F is dramatic. https://www.jhsph.edu/covid-19/articles/achieving-herd-immunity- Mathematically, R remained less than unity throughout in E; its with-covid19.html. [17] T. Britton, F. Ball, and P. Trapman, “The disease-induced herd immunity value after reopening was 0.985. We can see that the case rate level for Covid-19 is substantially lower than the classical herd immunity decreases monotonically all the time. In F, the post-reopening R level,” pp. 1–15, 2020, [Online]. Available: http://arxiv.org/abs/2005.03085. became 1.22 and sent the trajectory haywire. In practice however, [18] A. A. Peterson, C. F. Goldsmith, C. Rose, A. J. Medford, and T. Vegge, the incipient increase in case rate after the 80 th day acts as an “Should the rate term in the basic epidemiology models be second- advance warning of what has happened – the reopening steps order?,” 2020, [Online]. Available: http://arxiv.org/abs/2005.04704. should be reversed if it is at all possible to do so while satisfying economic and other external constraints. Conclusion In this Article we have presented a new mathematical model for COVID-19 which is simple and elegant in structure but can generate a variety of realistic solution classes. We hope that our work may be of use to mathematicians and data scientists who are trying to understand the spread of the disease in a quantitative manner. The public health implications of these results are being reserved for another study. REFERENCES [1] S. Jahedi and J. A. Yorke, “When the best pandemic models are the simplest .,” medRxiv, pp. 1–22, 2020, doi: https://doi.org/10.1101/2020.06.23.20132522. [2] L. Dell’Anna, “Solvable delay model for epidemic spreading: the case of Covid-19 in Italy,” 2020, [Online]. Available: http://arxiv.org/abs/2003.13571. [3] A. K. Gupta, N. Sharma, and A. K. Verma, “Spatial Network based model Public Health Implications of a delay differential equation model for COVID 19 Mohit M Sharma B Shayak Population and Health Sciences Sibley School of Mechanical and Weill Cornell Medicine Aerospace Engineering New York City, USA Cornell University Ithaca, New York State, USA mos4004@med.cornell.edu sb2344@cornell.edu ABSTRACT 175,000 by August 15th, 2020 [3]. Some features however, both nationally and globally, have proved counterintuitive. For This paper describes the strategies derived from a novel delay example, a 76-day lockdown resulted in the outbreak’s differential equation model[1], signifying a practical extension containment in Wuhan. A similar measure has produced similar of our recent work. COVID -19 is an extremely ferocious and an results in New Zealand. However, lockdown appeared only unpredictable pandemic which poses unique challenges for marginally effective in New York State, USA where the case and public health authorities, on account of which “case races” death counts decreased only after reaching horrifying peak levels among various countries and states do not serve any purpose and [4]. It was contended that the stay at home order in New York present delusive appearances while ignoring significant came too late. This apparent delay was not present in California, determinants. We aim to propose comprehensive planning USA. The case counts there went up all the same, and the rate is guidelines as a direct implication of our model. Our first high even today. We would like to mention that such consideration is reopening, followed by effective contact tracing spatiotemporal anomalies are present not just in the US but also and ensuring public compliance. We then discuss the in other countries such as Canada, Russia and India [5] which implications of the mathematical results on people’s behavior witnessed high case growth despite being in lockdown. In order and eventually provide conclusive points aimed at strengthening to better understand the epidemiology of the transmission of the arsenal of resources that are helpful in framing public health COVID-19, we have constructed a delay differential equation policies. The knowledge about pandemic and its association with model. Here we present its practical implications which tries to public health interventions is documented in the various encapsulate a myriad of factors associated with the current literature-based sources. In this study, we explore those resources scenario. to explain the findings inferred from delay differential equation model of covid-19. 2 MATHEMATICAL MODELING TO UNDERSTAND THE EPIDEMIOLOGY KEYWORDS Since many decades, mathematical modelling has been used Delay differential equation, Contact tracing, Socio-behavioral as an integral tool in recognizing the trend of disease progression theories, Lockdown, Reopening during pandemics. For example, using a simple model explaining the transmission dynamics of the infectious disease between the susceptible, infected and recovered population ( SIR Epidemic Models) Kermack and McKendrick proposed and later 1 INTRODUCTION established a principle – the level of susceptibility in the The national (USA) and global spread of Coronavirus Disease population should be adequately high in order for that epidemic 2019 (COVID-19), following its origins in Wuhan, China in at to unfold in that population. Such mathematical models can give least December 2019 and possibly earlier still [2] has been impressionable insights in explaining the epidemiological status alarmingly rapid and deadly. From the 25 individual national of the population, predict or calculate the transmissibility of the forecasts received by CDC, predicts that there is possibility of pathogen and the potential impact of public health preventive the total reported COVID -19 deaths is between 160,000 and In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, California, USA, August 24, 2020. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). KiML'20, San Diego, California, USA © 2020, Copyright held by the author(s). KDD KiML 2020 Sharma et al. practices [6]. However, a significant body of evidence suggests that decisions should be made regarding the parameters to be included, being contingent on the impact of the precision of predictions. Several policy questions about the containment of this outbreak have been considered in our recently proposed simple non-linear model [1]This paper delves into the practical solutions that can be devised utilizing the directions of our models’ outcome. In generating interpretable results gathered from epidemiological models, we have used the examples of six types of cities [1]: 1) City A – Moderately effective contact tracing in a hard lockdown. This city has R (reproductive number) <1 and drives epidemic to extinction in time. 2) City B – Less effective contact tracing in a hard lockdown. It starts off R >1, but reached R =1 at 15% infection level. The epidemic ends at 30% infection rate and takes a very long time to get there. 3) City C – Less effective contact tracing (Like City B) with milder restriction on mobility. It proceeds rapidly to herd immunity. 4) City D – Combination of City B and City C. Starts with mild restriction on mobility and progresses towards restriction. The duration of the epidemic as well as of the final case count is between CITIES B and C. 5) City E - Starts off like City A, it reopens with very effective contact tracing and drive the epidemic to extinction in time. 6) City F – Starts off like CITY A, it reopens with less effective contact tracing and suffers a second wave. Pragmatic implications of our work are as follows: KDD KiML 2020 Sharma et al. 3 REOPENING CONSIDERATIONS, ROLE Mr X as a potential case, having been exposed to a known case OF TESTING yesterday. Then, it can be that Mr X contracts the virus ten days from now, in which situation he will report negative if tested The unemployment situation generated as a result of today or tomorrow, but will still amount to a spreading risk ten lockdowns is currently forcing countries and states to partially days later if he is at large then. This also means that secondary reopen their economies even though many of them have not yet contact tracing, i.e. finding Mr. X’s contacts, must go ahead got the virus under control. The reopening is easiest in City A irrespective of his test results. Indeed, the medical authorities are regions where cases have slowed down to a trickle. With every well aware of this loophole. new case being detected, swift isolation of all potential secondary, tertiary and maybe even quaternary cases, both The US Chamber of Commerce has given out state by state forward and backward, should prove possible while the rest of reopening guides for small businesses which are mandated to be the economy functions in a relatively uninhibited way. Even one followed across the US. Continued following of federal, state, mass transmission event can restart an exponential growth tribal, territorial and local recommendations is of paramount regime and force a rollback to a fully locked down state. importance. Reopening beyond a skeletal level is impossible in City B regions which are still in the ascending phase. The ascent implies that Prior to resuming work, all workplaces should have a contact tracing is already inadequate, and on top of that if carefully chartered exposure control, mitigation and recovery mobility increases then the region might turn into City C, plan. Although essential guidance is specific for each business, overstress healthcare systems, and become a massacre. An there are certain measures that can be generally adopted across ascending B-City has little option other than to contact trace as all workplaces. hard as possible and wait for partial herd immunity to kick in. 1) Reopening in phases – The US government has laid down Only when that happens and the cases slow down on their own guidelines to open the country in 3 phases. First phase involves can it consider a more extensive reopening like a City A region. continuation of vulnerable individuals to remain at home. When Testing is an important part of the epidemic management in public, people are expected to wear masks, have maximum process no doubt since it enables the authorities to get an accurate physical separation, avoid places with more than 10 people and description of the spread of the disease. As we have already limit non-essential travel. Second phase allows gatherings of 50 discussed, limited testing capacity is giving us a partial or people, some nonessential travel and reopening of schools. Third distorted picture in many regions. There is a widespread media phase involves relaxation of restrictions, permitting vulnerable perception that extensive testing is one of the prerequisites for populations to operate. any kind of reopening process [7], [8]. Much criticism has also 2) Defining new metrics – Post-corona world will witness been levelled at certain countries for having inadequate testing some significant changes in regulatory controls, and behavioral programs (we shall further elaborate the blame aspects later). drift in personal and professional spheres. Cleanliness standards, However, we would like to emphasize that testing is as of yet a safety standards, infection prevention practices with regular diagnostic tool and not a preventive one. Currently, it can show monitoring and inspection for its assurance are some of the new us how the disease is behaving but cannot slow its spread in any terms that will have to be a part of a daily life of the people for way. Test-induced slowing can come only when the capacity at least the next few months. expands to such a level as to be able to preventively test potential super-spreaders such as grocers and food workers every single 3) Organizational changes – To help essential operations to day. We hope that such a development may prove possible in the function, companies and organizations will have to be prepared near future – many Universities for example are making with advanced IT systems (in case of continuation of remote reopening arrangements with provision for very frequent testing working), supply of PPE, setting up travel facilities to avoid of the entire community. public transport, providing behavioral health services, and leave no stone unturned in overcoming biological, physical, and During reopening it is vital to get a true picture of the disease emotional challenges. We can see that the above guidelines are evolution so that we can gauge the effect of any relaxation of broadly conformal to our model predictions. restrictions – whether it keeps the outbreak under control as in City E or brings about the beginnings of a second wave as in City 4 METHODS OF CONTACT TRACING F. Such beginnings are heralded by a rise in the case rate. As we saw, there was no such rise in City E even though R increased As we have already mentioned, contact tracing is probably the after the reopening. If the rise takes place, the relaxation must single most important factor in determining the progression of immediately be rolled back to avert the disaster. Hence, during COVID-19 in a region. We can see from the model that the faster reopening, the testing capacity must be high enough to detect the contact tracing takes place, the better; the more delay we such incipient rises. As per China’s state media reports, with an have, the higher R becomes. Moreover, our model does not aim to reopen the economy, the city of Wuhan conducted 6 account for backward contact tracing. In practice however, a million tests in one week; we present this fact without discussion sufficiently high level of detection might not be possible to or comment. A second reason why testing is still not all that it achieve with forward contact tracing alone. As much as it is could have been is the high false-negative rate during the initial important, contact tracing is also one of the trickiest aspects to stages of infection [9]. Suppose a contact tracing drive identifies handle since it can interfere with people’s privacy. In classical KDD KiML 2020 Sharma et al. contact tracing, human tracers talk to the confirmed cases and 2) Communicating the consequences involved with risky track down their movements as well as the persons they behaviors in a transparent manner – Central and state ministers interacted with over the past couple of days. This method has as well as public health authorities are in constant worked well in Ithaca, USA and in Kerala, India. While it is the communication with the masses. least invasive of privacy, it is also the most unreliable since people might not remember their movements or their interactions 3) Conveying information about the steps involved in correctly. The time taken in this method is also the maximum. A performing the recommended action and focusing on the benefits more sophisticated variant of this supplements human testimony to action – Famous celebrities, in addition to state and central with CCTV footage and credit/debit card transaction histories – governments, spread the messages explaining the required steps this approach is possible only in countries such as USA where cogently and ensuring that it has the maximum reach, especially card usage predominates over cash. The most sophisticated among social media-addicted millennials and similar contact tracing algorithms use artificial intelligence together with populations. location-tracking mobile devices and apps – while they are quick 4) Being open about the issues/barriers, identifying them at and fool-proof, they automatically raise issues of privacy and early stage and working toward resolution – Activating all sorts security. For example, the TraceTogether app in Singapore, of helpline numbers, email addresses, personal offices etc to which worked very well during the initial phases of the outbreak, address any grievances around the topic. has not found popularity with many users [10]. Similarly, India’s Aarogya Setu has also raised privacy concerns [11]. Americans 5) Developing skills and providing assistance that encourages too have expressed their aversion to using contact tracing apps in self-efficacy and possibility of positive behavior change – a recent poll, with only 43 percent of people saying that they Adequate arrangements for people from lower socio-economic trusted companies like Google or Apple with their data. strata, stable and trustworthy financial schemes for middle class, plan to support small business and a means to become a bridge between the affluent class and the needy class are some of the ways to foster positive behavior change and develop natural trust. 5 ENSURING SOCIAL COMPLIANCE – A Other than health belief model, some theories that can be useful BEHAVIORAL PERSPECTIVE are: As the epidemic drags on and on, the continued restrictions on social activity are becoming more and more unbearable. There Theory of Reasoned Action – This theory implies that an is an increasing tendency, especially among younger people who individual’s behavior is based on the outcomes which the are much less at risk of serious symptoms, to violate the individual expects as a result of such behavior. In a practical restrictions and spread the disease through irresponsible actions. scenario, if the health officials want the people to follow a However, City F, a rise in violator behavior can completely particular trend, let us say based on our model, they need to nullify the effects of lockdown over the past few weeks or reinforce the advantages of targeted behavior and strategically months. Here we discuss how public health professionals and address the barriers. For instance, to enforce separation minima policy makers can resort to behavior/psychological theories to even when it is apparently proving ineffective and the cases are ensure compliance among the common people. The most widely increasing, they can use the examples of Cities B and C to used model is Health Belief Model which has been used convince the citizens that violations – and hence violators – can successfully in addressing public health challenges. We briefly be responsible for thousands of excess deaths. Trans-theoretical discuss the utility of this model in the current situation. Model – This model posits that any health behavior change entails progress through six stages of change: precontemplation, Health belief model is a theoretical model which hypothesizes contemplation, preparation, action, maintenance and that interventions will be most effective if they target key factors termination. For instance, it was observed that in March, despite that influence health behaviors such as perceived susceptibility, a rise in cases in New York City (NYC), people were not perceived severity, perceived benefits, perceived barriers to observing social restrictions the way they should have. Now, we action and exposure to factors that prompt action and self- can see that with passing time, the behavior of the masses efficacy. In general, this model can be used to design short and transforms according to the stages of this model long term interventions. The prime components of this model which are relevant in the current scenario can be outlined as Precontemplation – This is a stage where people are follows. typically not cognizant of the fact that their behavior is troublesome and may cause undesirable consequences. There is 1) Conducting a health need assessment to determine the a long way to go before an actual behavior change. This phase target population – The best example is the demarcation of zones coincides with the commencement of cases in NYC. in India depending on the level of risk. Red zone is highest risk, orange zone is average risk and green zone translates into no Contemplation – Recognition of the behavior as problematic cases since last 21 days. Classification is multifactorial, taking begins to surface and a shift begins towards behavior change. into account the incidence of cases, the doubling rate and the When the cases started being reported all over media and the limit of testing and surveillance feedback to classify the districts. major cause of spread began to surface, citizens started paying attention to their activities. KDD KiML 2020 Sharma et al. Preparation – People start taking small steps toward on their course of action. Since the virus is a new one, there is no behavior change like in our case, exhibiting hygienic practices precedent which can act as a model. Even among emerging and ensuring six feet separation minima. infectious diseases, this latest one is particularly unpredictable, since minuscule changes in parameters can cause dramatic Action – This stage covers the phase where people have just changes in the system’s behavior. This phenomenon is best changed their behavior and have positive intention to maintain illustrated by the notional cities, discussed previously. For that approach. In this instance, people continue to practice social example, to get from City A to B, all we did was increase by 50 restrictions and hygiene positively. percent the fraction of people who escaped the contact-tracers’ net. The result was a 30 times (not 30 percent!) increase in the Maintenance – This stage focuses on maintenance and total number of cases. Similarly, the difference between Cities B continuity toward the adopted approach. Majority of people in and D is an 11-day delay (recall that the first seven days in the NYC are exhibiting positive behavior and maintaining it plots are the seeding period, so they don’t count) in imposing the throughout the stages of reopening phases. This is vitally lockdown in D. 11 days out of a 200-plus-day run might not important to ensure that NYC stops at partial herd immunity like sound like a lot. But, that was enough to create tens of thousands City D instead of blowing up again like City C. of additional cases, risk overstressing healthcare systems and at Termination – There is lack of motivation to come back to the same time shorten the epidemic duration by a factor of three. the unhealthy behaviors and some sections of people across the Further uncertainty comes from the fact that the parameter country/world will continue practicing good hygiene (though not values are changing constantly. It is a well- known fact the social restrictions!) in our day-to-day lives. reported fraction of asymptomatic carriers has increased Social Ecological theory – This theory highlights multiple continuously over the last three months or so. Considering the levels of influences that molds the decision. In our case, let us sensitivity of this or any other model to parameter values, such say for example that the decision is to maintain sufficient changes can completely invalidate the results of a model as well physical separation once offices are opened up. To successfully as any decision which was made on their basis. Identifying follow this, there is a complex interplay between individual, potential exposures is much easier in a smaller city than a large relationship, community and societal factors that comes into or densely populated one. It is also more effective if the cases are action. Law enforcement authorities need to take this into mostly from the sophisticated social class who can use mobile consideration. A group of individuals when motivated by one phone contact tracing apps or otherwise keep (at least mental) another to follow the guidelines, builds a good connection within records of their movements and of the people they interacted the society, and in turn there is a high probability to build a with. However, if there is an outbreak among the unsophisticated healthy network within a defined area. A negative interplay at class, then even the most skillful contact tracer might run up different levels of motivation may in turn, prove disastrous and against a wall of zero or false information. In such cases there are cause all efforts go down the drain. A perfect illustration of this limited options that are left to the authorities to proceed in a in the present condition is how various NGO’s are working in conducive manner. conjunction with public health authorities to bring about a change at an individual level by door-to-door campaigning. This propels the behavior of even the most potentially recalcitrant population India went into lockdown on 25 March 2020. At that time, the in the most desirable way i.e. wearing masks and gloves, official figures stated that there were only 571 cases, which made adopting hand hygiene, being cognizant of symptoms arising in the decision appear premature to many people. Indeed, a seven- any member of the family and following quarantine rules in case day delay of lockdown was suggested so that the migrant workers of travel from other states. would have been able to return to their homes. However, when the lockdown was imposed, the testing had also been woefully 6 SOCIAL ATTITUDES AND BEHAVIOUR inadequate, with a nationwide total of just 22,694 tests having In this Section we address another important issue related to been conducted up to that date. If we use the extrapolation the Coronavirus. This is that the widely heterogeneous case technique of inferring case counts from death counts, then using profiles in different regions have often led to “corona contests” the same 1 percent mortality rate and 20 day interval to death, we among these regions. Far too often, the residents of better-off find almost 40,000 assumed cases on the day that the lockdown regions are seen heaping scorn on worse-hit regions. We have began. If we go by this figure, then the lockdown wasn’t really selected a tiny handful of representative media articles, early, and possibly should have been enforced earlier still in castigating the approaches of India, USA and Sweden, to show trouble zones such as Mumbai. Certainly, if the figure of 40,000 the breadth and vitriol of such commentary [12][13][14] cases is true, then one further week of normal life (with huge [15][16].A feature common to almost all opinion pieces like this crowds in trains and railway stations) might have been is that their authors do not have the slightest knowledge of the disastrous. From the vantage point of today, alternate issues involved, either epidemiological or economic. arrangements should definitely have been made much earlier for rehabilitation of the migrant workers. However these Before embarking on criticisms, we should note that policy arrangements would have involved considerable complexity in decisions need to be taken in real time, as the situation evolves. the prevailing situation, and were certainly not as easy as one The authorities do NOT have the benefit of hindsight to decide KDD KiML 2020 Sharma et al. week’s delay in announcing lockdown. Sweden, which has • Efficiency of contact tracing comes at the expense of adopted a controlled herd immunity strategy, has been accused people’s privacy – balancing between the two is a delicate of playing with fire. It is also possible that the Swedish optimization problem. authorities are aware that they do not have the contact tracing capacity required for performing like City A and hence are • In some regions, restrictions such as masks and six-feet attempting something like City D – a faster end of the epidemic separation minima must be maintained for a very long time to than City B at the expense of a higher case count. To make a come. The public health authorities can ensure compliance by comprehensive analysis of their policy, it is crucial to know not resorting to socio –behavioral theories/approaches. only the last intricate detail of the epidemiological aspects but In deploying advanced contact tracing techniques, also the details of the economic considerations. That is almost significant consideration has to be given for ensuring high impossible. On a different note however, we have seen reports data security and lay down privacy regulations that are [17], [18] stating that the virus has entered into old age homes convincing to the users and similar establishments, causing hundreds of deaths over there. Assuming that these reports are not overturned in the Control the spread by swift identification and course of time, allowing the ingress of virus into high-risk areas isolation of cases accompanied by tracing and quarantine for is an indefensible action, whatever the overall epidemiological at least 2 weeks strategy. Empowering of individuals and communities by the government to facilitate efficient capacity building. Finally, extremely important public health factors such as the racial dependence of susceptibility and/or transmissibility have just Multidisciplinary coordination, strong leadership to started coming to the surface. Another complete grey area is the mobilize communities and take quick decisions coupled with mutations which this new and vicious virus are undergoing and what thoughtful development of operation plans are likely to prove effect they might have on the spreading dynamics. Some reports also considerably efficient in handling this pandemic to the best of reflect that the change in genetic composition due to mutation might our capacity. be the reason behind huge differences in the crude infection rate between countries [19][20]. In the absence of a clear picture about References this, any public health measure is all the more likely to be a random [1] B. Shayak and M. M. Sharma, “Retarded guess with non-zero probabilities of both success and failure. Not logistic equation as a universal dynamic model for the everything about corona is random or outside one’s control though. spread of COVID-19,” medRxiv, p. Amongst the European countries, we can see that Germany, Austria, 2020.06.09.20126573, 2020, doi: Switzerland, Denmark, Norway and Finland have definitely 10.1101/2020.06.09.20126573. managed the epidemic while their neighbors have not, which rules out some hidden luck factor. The same has happened in Kerala and [2] E. Okanyene, B. Rader, Y. L. Barnoon, L. Karnataka (also in India). This has been feasible only due to Goodwin, and J. S. Brownstein, “Analysis of hospital governmental awareness and hard work, and people’s cooperation. traffic and search engine data in Wuhan China indicates early disease activity in the Fall of 2019,” Similarly, there are some governments which have been clearly Harvard, 2020, [Online]. Available: guilty of negligence or hubris in their management of the disease. It http://nrs.harvard.edu/urn-3:HUL.InstRepos:42669767. would also be noteworthy to observe and take lessons from the some of the new places like Alabama, Arkansas, Florida , Texas etc which [3] CDC, “Forecasting COVID-19 in the US,” have been recently identified as potential hotspots of this pandemic. 2020. https://www.cdc.gov/coronavirus/2019- Lastly, our conclusion best resonates with the message that ncov/covid-data/forecasting-us.html. coronavirus is not some kind of race but a public health disaster and [4] “Microsoft coronavirus webpage.” we should adopt a unified approach to the fight against it. https://www.bing.com/covid. CONCLUSION [5] “COVID-19 in India.” [Internet]. Available Here, we summarize the take-home messages from this paper: from: https://www.covid19india.org/. • A city can reopen only if it is past the peak of cases. [6] L. Star and S. Moghadas, “The Role of Reopening must be accompanied by robust contact tracing. The Mathematical Modelling in Public Health Planning and US CDC has laid down a set of reopening guidelines which are Decision Making,” Natl. Collab. Cent. Infect. Dis., vol. (5)2, no. 2, pp. 285–299, 2010. compatible with our model and its solutions. [7] Livemint, ““Many states are far short of • Incorporation of socio-behavioral theories can come COVID-19 testing levels.” into play for effective execution of interventional strategies. https://www.statnews.com/2020/04/27/coronavirus- many-states-short-of-testing-levels-needed-for- safereopening/. KDD KiML 2020 Sharma et al. [8] Harvard Business Review, “A Plan to no-longer-exists-provokes-controversy.html. Safely Reopen the U.S. Despite Inadequate Testing.” https://hbr.org/2020/05/a-plan-to-safely-reopen-the-u- s-despite-inadequate-testing. [9] S. Telles, S. K. Reddy, and H. R. Nagendra, “Variation in False Negative Rate of RT-PCR Based SARS-CoV-2 Tests by Time Since Exposure,” J. Chem. Inf. Model., vol. 53, no. 9, pp. 1689–1699, 2019, doi: 10.1017/CBO9781107415324.004. [10] M. Lee, “Given low adoption rate of TraceTogether, experts suggest merging with SafeEntry or other apps,” Today, 2020. https://www.todayonline.com/singapore/given-low- adoption-rate-tracetogether-experts-suggest-merging- safeentry-or-other-apps. [11] A. Zargar, “Privacy, security concerns as India forces virus-tracking app on millions,” CBS News. . [12] K. Bajpai, “Five lessons of COVID.” Available: from: https://timesofindia.indiatimes.com/blogs/toi- editpage/five-lessons-of-covid-factors-that-are- negative-for-india-are-having-greater-impact-than- mitigating-ones/.. [13] K. Grimes, “Is politics the reason why Gov. Newsom is keeping California locked down ?,” California Globe. . [14] R.Guha, “What Modi got wrong on COVID-19 and how he can fix it.” https://www.ndtv.com/opinion/5-lessons-for-modi-on- covid-19-by-ramachandra-guha-2227259. [15] K. Weintraub, “Sweden sticks with controverial covid approach.,” [Online]. Available: https://www.webmd.com/lung/news/20200501/sweden -sticks-with-controversial-covid19-approach. [16] The Island Now, “Cuomo has failed in his handling of coronavirus.” https://theislandnow.com/opinions-100/readers-write- cuomo-has-failed-in-handling-of-coronavirus/. [17] “Are care homes the dark side of Sweden’s coronavirus strategy.” https://www.euronews.com/2020/05/19/are-care- homes-the-dark-side-of-sweden-s-coronavirus- strategy. [18] “What’s going wrong in Sweden’s care homes.” . [19] L. van Dorp et al., “Emergence of genomic diversity and recurrent mutations in SARS-CoV-2,” Infect. Genet. Evol., vol. 83, no. May, p. 104351, 2020, doi: 10.1016/j.meegid.2020.104351. [20] H. Ellyatt, “Coronavirus no longer exists clinically - controversy,” CNBC. https://www.cnbc.com/2020/06/02/claim-coronavirus- Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets Jitin Krishnan Hemant Purohit Huzefa Rangwala Department of Computer Science Department of Information Department of Computer Science George Mason University Sciences & Technology George Mason University Fairfax, VA George Mason University Fairfax, VA jkrishn2@gmu.edu Fairfax, VA rangwala@gmu.edu hpurohit@gmu.edu ABSTRACT 1 INTRODUCTION State-of-the-art models for cross-lingual language understanding Social media platforms such as Twitter provide valuable information such as XLM-R [7] have shown great performance on benchmark to aid emergency response organizations in gaining real-time situ- data sets. However, they typically require some fine-tuning or cus- ational awareness during the sudden onset of crisis situations [4]. tomization to adapt to downstream NLP tasks for a domain. In this Extracting critical information about affected individuals, infras- work, we study unsupervised cross-lingual text classification task tructure damage, medical emergencies, or food and shelter needs in the context of crisis domain, where rapidly filtering relevant data can help emergency managers make time-critical decisions and regardless of language is critical to improve situational awareness allocate resources efficiently [15, 21, 22, 30, 31, 36]. Researchers of emergency services. Specifically, we address two research ques- have designed numerous classification models to help towards this tions: a) Can a custom neural network model over XLM-R trained humanitarian goal of converting real-time social media streams into only in English for such classification task transfer knowledge to actionable knowledge [1, 22, 26, 28, 29]. Recently, with the advent multilingual data and vice-versa? b) By employing an attention of multilingual models such as multilingual BERT [9] and XLM mechanism, does the model attend to words relevant to the task [20], researchers have started adopting them to multilingual disas- regardless of the language? To this goal, we present an attention ter tweets [6, 25]. Since XLM-R [7] has been shown to be the most realignment mechanism that utilizes a parallel language classifier to superior model in cross-lingual language understanding, we re- minimize any linguistic differences between the source and target strict our work to this model to explore the aspects of cross-lingual languages. Additionally, we pseudo-label the tweets from the target transfer of knowledge and interpretability. language which is then augmented with the tweets in the source language for retraining the model. We conduct experiments using Twitter posts (tweets) labelled as a ‘request’ in the open source data set by Appen1 , consisting of multilingual tweets for crisis re- sponse. Experimental results show that attention realignment and pseudo-labelling improve the performance of unsupervised cross- lingual classification. We also present an interpretability analysis by evaluating the performance of attention layers on original versus translated messages. KEYWORDS Social Media, Crisis Management, Text Classification, Unsuper- vised Cross-Lingual Adaptation, Interpretability ACM Reference Format: Jitin Krishnan, Hemant Purohit, and Huzefa Rangwala. 2020. Attention Re- Figure 1: Problem: Unsupervised cross-lingual tweet classifi- alignment and Pseudo-Labelling for Interpretable Cross-Lingual Classifica- cation, e.g., train a model using English tweets, predict labels tion of Crisis Tweets. In Proceedings of KDD Workshop on Knowledge-infused for Multilingual tweets, and vice-versa. Mining and Learning (KiML’20). , 7 pages. https://doi.org/10.1145/nnnnnnn. nnnnnnn In this work, we address two questions. First is to examine 1 https://appen.com/datasets/combined-disaster-response-data/ whether XLM-R is effective in capturing multilingual knowledge by constructing a custom model over it to analyze if a model trained using English-only tweets will generalize to multilingual data and In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, vice-versa. Social media streams are generally different from other California, USA, August 24, 2020. Use permitted under Creative Commons License text, given the user-generated content. For example, tweets are Attribution 4.0 International (CC BY 4.0). usually short with possibly errors and ambiguity in the behavioral KiML’20, August 24, 2020, San Diego, California, USA, © 2020 Copyright held by the author(s). expressions. These properties in turn make the classification task or https://doi.org/10.1145/nnnnnnn.nnnnnnn extracting representations a bit more challenging. Second question KiML’20, August 24, 2020, San Diego, California, USA, Krishnan, et al. is to examine whether word translations will be equally attended With more and more machine learning systems being adopted by the attention layers. For instance, the words with higher atten- by diverse application domains, transparency in decision-making tion weights in a sentence in Haitian Creole such as “Tanpri nou inevitably becomes an essential criteria, especially in high-risk bezwen tant avek dlo nou zon silo mesi” should align with the words scenarios [12] where trust is of utmost importance. With deep in its corresponding translated tweet in English “Please, we need neural networks, including natural language systems, shown to tents and water. We are in Silo, Thank you!”. Our core idea is that if be easily fooled [16], there has been many promising ideas that ‘dlo’ in the Haitian tweet has a higher weight, so should its English empower machine learning systems with the ability to explain translation ‘water’. This word-level language agnostic property can their predictions [5, 32]. Gilpin et al. [11] presents a survey of promote machine learning models to be more interpretable. This interpretability in machine learning, which provides a taxonomy of also brings several benefits to downstream tasks such as knowledge research that addresses various aspects of this problem. Similar to graph construction using keywords extracted from tweets. In situa- the work by Ross et al. [33], we employ an attention-based approach tions where data is available only in one language, this similarity in to evaluate model interpretability applied to the crisis-domain. attention would still allow us to extract relevant phrases in cross- lingual settings. To the best of our knowledge in crisis analytics 3 METHODOLOGY domain, aligning attention in cross-lingual setting is not attempted 3.1 Problem Statement: Unsupervised before. In this work, we focus our classification experiments only to tweets containing ‘request’ intent, which will be expanded to Cross-Lingual Crisis Tweet Classification other behaviors, tasks, and datasets in the future. Consider tweets in language A and their corresponding translated Contributions: We propose a novel attention realignment method tweets in language B. The task of unsupervised cross-lingual classi- which promotes the task classifier to be more language agnostic, fication is to train a classifier using the data only from the source which in turn tests the effectiveness of multilingual knowledge language and predict the labels for the data in the target language. capture of XLM-R model for crisis tweets; and a pseudo-labelling This experimental set up is usually represented as 𝐴 → 𝐵 for train- procedure to further enhance the model’s generalizability. Furher, ing a model using A and testing on B or 𝐴 → 𝐵 for training a incorporating the attention-based mechanism allows us to perform model using B and testing on A. 𝑋 refers to the data and 𝑌 refers an interpretability analysis on the model, by comparing how words to the ground truth labels. The multilingual dataset used in our are attended in the original versus translated tweets. experiments consists of original multilingual (𝑚𝑙) tweets and their translated (𝑒𝑛) tweets in English. To summarize: Experiment 𝐴 (𝑒𝑛 → 𝑚𝑙): 2 RELATED WORK AND BACKGROUND Input: 𝑋𝑒𝑛 , 𝑌𝑒𝑛 , 𝑋𝑚𝑙 𝑝𝑟𝑒𝑑 There are numerous prior works (c.f. surveys [4, 14]) that focus Output: 𝑌𝑚𝑙 ← 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 (𝑋𝑚𝑙 ) specifically on disaster related data to perform classification and Experiment 𝐵 (𝑚𝑙 → 𝑒𝑛): other rapid assessments during an onset of a new disaster event. Input: 𝑋𝑚𝑙 , 𝑌𝑚𝑙 , 𝑋𝑒𝑛 Crisis period is an important but challenging situation, where col- 𝑝𝑟𝑒𝑑 Output: 𝑌𝑒𝑛 ← 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 (𝑋𝑒𝑛 ) lecting labeled data during an ongoing event is very expensive. This problem led to several works on domain adaptation techniques in 3.2 Overview which machine learning models can learn and generalize to unseen crisis event [3, 10, 18, 23]. In the context of crisis data, Nguyen et al. In the following sections, we propose two methodologies to en- [28] designed a convolutional neural network model which does not hance cross-lingual classification: 1) Attention Realignment and 2) require any feature engineering and Alam et al. [1] designed a CNN Pseudo-Labelling. Attention realignment utilizes a language clas- architecture with adversarial training on graph embeddings. Krish- sifier which is trained in parallel to realign the attention layer of nan et al. [19] showed that sharing a common layer for multiple the task classifier such that the weights are more geared towards tasks can improve performance of tasks with limited labels. task-specific words regardless of the language. Pseudo-Labelling In multilingual or cross-lingual direction, many works [8, 17] further enhances the classifier by adding high quality seeds from tried to align word embeddings (such as fastText [27]) from different the target language that are pseudo-labelled by the task classifier. languages into the same space so that a word and its translations have the same vector. These models are superseded by models such 3.3 Attention Realignment by Parallel as multilingual BERT [9] and XLM-R [7] that produce contextual Language Classifier embeddings which can be pretrained using several languages to- As depicted in Fig 2, model on the left side is the task classifier and gether to achieve impressive performance gains on multilingual the model on the right side is a language classifier that is trained in use-cases. parallel. The purpose of this language classifier is to pick up aspects Attention mechanism [2, 24] is one of the most widely used meth- that is missed by the XLM-R model. This could be tweet-specific, ods in deep learning that can construct a context vector by weigh- crisis-specific, or other linguistic nuances that can separate original ing on the entire input sequence which improves over previous tweets and translated tweets. Note that semantically, translated sequence-to-sequence models [13, 34, 35]. As the model produces words are expected to have similar XLM-R representations. weights associated with each word in a sentence, this allows for Attention realignment is a mechanism we introduce to promote evaluating interpretability by comparing the words that are given the task classifier to be more language independent. The main idea priority in original versus translated tweets. is that the words that are given higher attention in a language KiML’20, August 24, 2020, San Diego, California, USA, Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets Figure 2: Attention Realignment with Pseudo-Labelling over XLM-R model Notation Definition representation in language agnostic models; while the sentence 𝑒𝑛 Tweets translated to English (‘message’ structure, grammar, and other nuances can vary. We enforce this column in the dataset) rule by constructing two operations: 𝑚𝑙 Multilingual Tweets (‘original’ column (1) Attention Difference: When a sentence goes through model in the dataset) M1, it also goes through model M2. For the same sentence, 𝛼 Attention Layer this returns two attention layer weights: one from the task 𝑇 A component that uses Task-specific classifier (𝛼− →) and the other from the language classifier 𝑇 data. i.e., + and − ‘Request’ tweets (𝛼𝑇 ). Directly subtracting 𝛼− − → ′ → ′ from 𝛼− 𝑇 → poses two issues: 1) 𝑇 𝐿 A component that uses Language- we do not know whether they are comparable and 2) 𝛼− →′ 𝑇 specific data. i.e., 𝑒𝑛 and 𝑚𝑙 tweets may have negative values. A simple solution to this is to 𝑎𝐵𝑖𝐿𝑆𝑇 𝑀 Activation from the BiLSTM layer normalize bothe vectors and clip 𝛼−→ ′ such that it is between 𝑇 𝛽, 𝛾, 𝜁 Hyperparameters 0 and 1. Thus, an attention subtraction step is as follows: Table 1: Notations 𝛼− → 𝛼− →′ ! 𝑇 𝑇 → − 𝛾𝑇 𝑐𝑙𝑖𝑝 𝛼− → ′ , 0, 1 𝛼− (1) 𝑇 𝑇 classifier should be less important in a task classifier. For example, where 𝛾𝑇 is a hyperparameter to tune the amount of subtrac- ‘dlo’ in Haitian and ‘water’ in English should have the same vector tion needed for the task classifier. Similarly, for the language KiML’20, August 24, 2020, San Diego, California, USA, Krishnan, et al. classifier, 𝑇𝑥 30 Deep Learning Library Keras 𝛼− →′ 𝛼− → ! 𝐿 𝐿 − 𝛾𝐿 𝑐𝑙𝑖𝑝 , 0, 1 (2) Optimizer Adam [𝑙𝑟 = 0.005, 𝑏𝑒𝑡𝑎 1 = 0.9, 𝛼→ ′ − 𝐿 𝛼− → 𝐿 𝑏𝑒𝑡𝑎 2 = 0.999, 𝑑𝑒𝑐𝑎𝑦 = 0.01] (2) Attention Loss: Along with attention difference, the model Maximum Epoch 100 can also be trained by inserting an additional loss function Dropout 0.2 term that penalizes the similarity between the attention Early Stopping Patience 10 weights from the two classifiers. We use the Frobenius norm. Batch Size 32 𝐿 = ∥𝛼− 𝐴𝑡 →𝑇 𝛼−→′ ∥ 2 𝑇 𝑇 𝐹 (3) 𝜁𝑇 1 𝜁𝐿 0.1 𝐿𝐴𝑙 = ∥𝛼− →𝑇 𝛼− 𝐿 →′ ∥ 2 𝐿 𝐹 (4) 𝛽𝑇 , 𝛽𝐿 , 𝛾𝑇 , 𝛾𝐿 0.01 for task and language respectively. Resulting final loss func- Table 3: Implementation Details tion of joint training will be: 𝐿(𝜃 ) = 𝜁𝑇 𝐶𝐸𝑇 + 𝛽𝑇 𝐿𝐴𝑡 + 𝜁𝐿 𝐶𝐸𝐿 + 𝛽𝐿 𝐿𝐴𝑙 (5) where 𝛽 is the hyperparameter to tune the attention loss We use the open source dataset from Appen3 consisting of multi- weight, 𝜁 is the hyperparameter to tune the joint training lingual crisis response tweets. The dataset statistics for tweets with loss, and 𝐶𝐸 denotes the binary cross entropy loss, ‘request’ behavior labels is shown in Table 2. For all the experiments, the dataset is balanced for each split. 𝑁 1 Õ Each experiment is denoted as 𝐴 → 𝐵, where 𝐴 is the data that 𝐶𝐸 = − [𝑦𝑖 log 𝑦ˆ𝑖 + (1 − 𝑦𝑖 ) log(1 − 𝑦ˆ𝑖 )] (6) 𝑁 𝑖=1 is used to train the model and 𝐵 is the data that is used for testing the model. For example, 𝑒𝑛 → 𝑚𝑙 means we train the model using It is important to note that the Frobenius norm is not simply English tweets and test on multilingual tweets. between the attention weights of the two models but rather Models are implemented in Keras and the details are shown in between the attention weights produced by the two models table 3. Hyperparameters 𝛽𝑇 , 𝛽𝐿 , 𝛾𝑇 , and 𝛾𝐿 are not exhaustively on the same input tweet. For example, for a given tweet, the tuned; we leave this exploration for future work. task classifier attends more to task-specific words and the language classifier attends to language-specific words. So the mechanism makes sure that they are distinct. Baseline Model M1 Model M2 𝑒𝑛 → 𝑚𝑙 59.98 62.53 66.79 3.4 Pseudo-Labelling (80.57) (77.02) (82.39) To enhance the model further, we pseudo-label the data in the 𝑚𝑙 → 𝑒𝑛 60.93 65.69 70.95 target language. For example, if we are training a model using the (70.07) (63.50) (73.84) English tweets, we use the original tweets before translation for Table 4: Performance Comparison (Accuracy in %) for pseudo-labelling. The idea is simply to gather high-quality seeds 𝑆𝑜𝑢𝑟𝑐𝑒 → 𝑇 𝑎𝑟𝑔𝑒𝑡 (𝑆𝑜𝑢𝑟𝑐𝑒 → 𝑆𝑜𝑢𝑟𝑐𝑒). from the target to retrain the model. Note that, we still do not use Baseline = XLMR + BiLSTM + Attention. any target labels here; still following the unsupervised goal. Thus, Model M1 = Baseline + Attention Realignment. for retraining model M1 for 𝑒𝑛 → 𝑚𝑙, the new dataset would consist Model M2 = Model M1 + Pseudo-Labelling. + and 𝑋 𝑝𝑠𝑒𝑢𝑑𝑜+ as positive examples and 𝑋 − and 𝑋 𝑝𝑠𝑒𝑢𝑑𝑜− of: 𝑋𝑒𝑛 𝑚𝑙 𝑒𝑛 𝑚𝑙 as negative examples. 3.5 XLM-R Usage 5 RESULTS & DISCUSSION The recommended feature usage of XLM-R2 is either by fine-tuning Table 4 shows the cross-lingual performance comparison of all the to the task or by aggregating features from all the 25 layers. We models. The three models are described below: employ the later to extract the multilingual embeddings for the (1) Baseline: The baseline model consists of embeddings re- tweets. trieved from XLM-R trained over BiLSTMs and Attention lay- ers. This is a traditional sequence (text) classifier enhanced 4 DATASET & EXPERIMENTAL SETUP with attention mechanism. Activations from the BiLSTM layers are weighed by the attention layer to construct the Train Validation Test context vector which is then passed through a dense layer Positive 3554 418 496 and softmax function to produce the classification output. Negative 17473 2152 2128 (2) Model M1: Adding attention realignment to the baseline model produces model M1. Attention realignment is achieved Table 2: Dataset Statistics for both 𝑒𝑛 amd 𝑚𝑙 through a language classifier which is trained in parallel with the goal to make the task classifier more language agnostic. 2 https://github.com/facebookresearch/XLM 3 https://appen.com/datasets/combined-disaster-response-data/ KiML’20, August 24, 2020, San Diego, California, USA, Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets Figure 3: Attention visualization example for ‘request’ tweets: words and their attention weights for two tweets in Haitian Creole and its translation in English (darker the shade, higher the attention). The attention weights for both task and language classifiers scores are shown in brackets in table 4. A deeper investigation in are manipulated by each other during training by a process this direction on various other tasks can shed more light on the of subtraction (attention difference) as well a loss component impact of realignment mechanism. (attention loss). See section 3.3. (3) Model M2: Adding the pseudo-labelling procedure to model 5.1 Interpretability: Attention Visualization M1 produces model M2. Using Model M1 which is trained We follow a similar attention architecture shown in [18]. The con- to be language agnostic, tweets from the target languages text vector is constructed as a result of dot product between the are pseudo-labelled. High quality seeds are selected (using attention weights and word activations. This represents the inter- Model M1 𝑝>0.7) and augmented to the original training pretable layer in our architecture. The attention weights represent dataset to retrain the task classifier. the importance of each word in the classification process. Two ex- Results show that, for cross-lingual evaluation on 𝑒𝑛 → 𝑚𝑙, amples are shown in figure 3. In the first example, both 𝑒𝑛 → 𝑒𝑛 model M1 outperforms the baseline by +4.3% and model M2 outper- and 𝑚𝑙 → 𝑚𝑙 give attention to the word ‘hungry’ (i.e., ‘grangou’ in forms by +11.4%. On 𝑚𝑙 → 𝑒𝑛, model M1 outperforms the baseline Haitian Creole). Note that these two are results from the models by +7.8% and model M2 outperforms by +16.5%. This shows that that are trained in the same language in which they are tested; thus, both models are effective in cross-lingual crisis tweet classification. expecting an ideal performance. For the baseline model in the cross- An interesting observation to note is that using attention realign- lingual set-up 𝑒𝑛 → 𝑚𝑙, although it correctly predicts the label, the ment alone decreased the classification performance in the same attention weights are more spread apart. In model M2 with atten- language, which is brought back up by pseudo-labelling. These tion realignment and pseudo-labelling, although with some spread, KiML’20, August 24, 2020, San Diego, California, USA, Krishnan, et al. the attention weights are shifted more toward ‘grangou’. Similarly [8] Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, in example 2, the attention weights in the baseline model are more and Hervé Jégou. 2017. Word Translation Without Parallel Data. arXiv preprint arXiv:1710.04087 (2017). spread apart. Cross-lingual performance of model M2 aligns more [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: with 𝑒𝑛 → 𝑒𝑛 and 𝑚𝑙 → 𝑚𝑙. These examples show the importance Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). of having interpretability as a key criterion in cross-lingual crisis [10] Yaroslav Ganin and Victor Lempitsky. 2014. Unsupervised domain adaptation by tweet classification problems; which can also be used for down- backpropagation. arXiv preprint arXiv:1409.7495 (2014). stream tasks such as extracting relevant keywords for knowledge [11] Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. 2018. Explaining explanations: An overview of interpretability of graph construction. machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA). IEEE, 80–89. [12] David Gunning. 2017. Explainable artificial intelligence (xai). Defense Advanced 6 CONCLUSION Research Projects Agency (DARPA), nd Web 2 (2017). [13] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural We presented a novel approach for unsupervised cross-lingual cri- computation 9, 8 (1997), 1735–1780. sis tweet classification problem using a combination of attention [14] Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. 2015. Processing social media messages in mass emergency: A survey. ACM Computing realignment mechanism and a pseudo-labelling procedure (over Surveys (CSUR) 47, 4 (2015), 1–38. the state-of-the-art multilingual model XLM-R) to promote the task [15] Muhammad Imran, Prasenjit Mitra, and Carlos Castillo. 2016. Twitter as a lifeline: classifier to be more language agnostic. Performance evaluation Human-annotated twitter corpora for NLP of crisis-related messages. arXiv preprint arXiv:1605.05894 (2016). showed that both models M1 and M2 outperformed the baseline by [16] Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading +4.3% and +11.4% respectively for cross-lingual text classification comprehension systems. arXiv preprint arXiv:1707.07328 (2017). from English to Multilingual. We also presented an interpretabil- [17] Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Hervé Jégou, and Edouard Grave. 2018. Loss in translation: Learning bilingual word mapping with a retrieval ity analysis by comparing the attention layers of the models. It criterion. arXiv preprint arXiv:1804.07745 (2018). shows the importance of incorporating a word-level language ag- [18] Jitin Krishnan, Hemant Purohit, and Huzefa Rangwala. 2020. Diversity-Based Generalization for Neural Unsupervised Text Classification under Domain Shift. nostic characteristic in the learning process, when training data https://arxiv.org/pdf/2002.10937.pdf (2020). is available only in one language. Performing extensive hyperpa- [19] Jitin Krishnan, Hemant Purohit, and Huzefa Rangwala. 2020. Unsupervised and rameter tuning and expanding the idea to other tasks (including Interpretable Domain Adaptation to Rapidly Filter Social Web Data for Emergency Services. arXiv preprint arXiv:2003.04991 (2020). cross-task/multi-task) are left as future work. We also plan another [20] Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model direction for future work as to incorporate the human-engineered pretraining. arXiv preprint arXiv:1901.07291 (2019). knowledge from the multilingual knowledge graphs such as Ba- [21] Kathy Lee, Ankit Agrawal, and Alok Choudhary. 2013. Real-time disease surveil- lance using twitter data: demonstration on flu and cancer. In Proceedings of the belNet in our model architecture that could improve the learning 19th ACM SIGKDD international conference on Knowledge discovery and data of similar concepts across languages critical to the crisis response mining. 1474–1477. [22] Hongmin Li, Doina Caragea, Cornelia Caragea, and Nic Herndon. 2018. Disaster agencies. response aided by tweet classification with a domain adaptation approach. Journal Reproducibility: Source code is available available at: https:// of Contingencies and Crisis Management 26, 1 (2018), 16–27. github.com/jitinkrishnan/Cross-Lingual-Crisis-Tweet-Classification [23] Zheng Li, Ying Wei, Yu Zhang, and Qiang Yang. 2018. Hierarchical attention transfer network for cross-domain sentiment classification. In Thirty-Second AAAI Conference on Artificial Intelligence. [24] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effec- 7 ACKNOWLEDGEMENT tive approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015). Authors would like to thank U.S. National Science Foundation [25] Guoqin Ma. 2019. Tweets Classification with BERT in the Field grants IIS-1815459 and IIS-1657379 for partially supporting this of Disaster Management. https://pdfs.semanticscholar.org/d226/ research. 185fa1e14118d746cf0b04dc5be8f545ec24.pdf. [26] Reza Mazloom, Hongmin Li, Doina Caragea, Cornelia Caragea, and Muhammad Imran. 2019. A Hybrid Domain Adaptation Approach for Identifying Crisis- Relevant Tweets. International Journal of Information Systems for Crisis Response REFERENCES and Management (IJISCRAM) 11, 2 (2019), 1–19. [1] Firoj Alam, Shafiq Joty, and Muhammad Imran. 2018. Domain adaptation with [27] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Ar- adversarial training and graph embeddings. arXiv preprint arXiv:1805.05151 mand Joulin. 2018. Advances in Pre-Training Distributed Word Representations. (2018). In Proceedings of the International Conference on Language Resources and Evalua- [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma- tion (LREC 2018). chine translation by jointly learning to align and translate. arXiv preprint [28] Dat Tien Nguyen, Kamela Ali Al Mannai, Shafiq Joty, Hassan Sajjad, Muham- arXiv:1409.0473 (2014). mad Imran, and Prasenjit Mitra. 2016. Rapid classification of crisis-related [3] John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation data on social networks using convolutional neural networks. arXiv preprint with structural correspondence learning. In Proceedings of the 2006 conference on arXiv:1608.03902 (2016). empirical methods in natural language processing. 120–128. [29] Ferda Ofli, Patrick Meier, Muhammad Imran, Carlos Castillo, Devis Tuia, Nicolas [4] Carlos Castillo. 2016. Big crisis data: social media in disasters and time-critical Rey, Julien Briant, Pauline Millet, Friedrich Reinhard, Matthew Parkan, et al. 2016. situations. Cambridge University Press. Combining human computing and machine learning to make sense of big (aerial) [5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter data for disaster response. Big data 4, 1 (2016), 47–59. Abbeel. 2016. Infogan: Interpretable representation learning by information max- [30] Bahman Pedrood and Hemant Purohit. 2018. Mining help intent on twitter during imizing generative adversarial nets. In Advances in neural information processing disasters via transfer learning with sparse coding. In International Conference systems. 2172–2180. on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior [6] Jishnu Ray Chowdhury, Cornelia Caragea, and Doina Caragea. 2020. Cross- Representation in Modeling and Simulation. Springer, 141–153. Lingual Disaster-related Multi-label Tweet Classification with Manifold Mixup. [31] Hemant Purohit, Carlos Castillo, Fernando Diaz, Amit Sheth, and Patrick Meier. In Proceedings of the 58th Annual Meeting of the Association for Computational 2013. Emergency-relief coordination on social media: Automatically matching Linguistics: Student Research Workshop. 292–298. resource requests and offers. First Monday 19, 1 (Dec. 2013). [7] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil- [32] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should i laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning ACM SIGKDD international conference on knowledge discovery and data mining. at scale. arXiv preprint arXiv:1911.02116 (2019). 1135–1144. KiML’20, August 24, 2020, San Diego, California, USA, Attention Realignment and Pseudo-Labelling for Interpretable Cross-Lingual Classification of Crisis Tweets [33] Andrew Slavin Ross, Michael C Hughes, and Finale Doshi-Velez. 2017. Right for [36] István Varga, Motoki Sano, Kentaro Torisawa, Chikara Hashimoto, Kiyonori the right reasons: Training differentiable models by constraining their explana- Ohtake, Takao Kawai, Jong-Hoon Oh, and Stijn De Saeger. 2013. Aid is out there: tions. arXiv preprint arXiv:1703.03717 (2017). Looking for help from tweets during a large scale disaster. In Proceedings of the [34] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural net- 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: works. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681. Long Papers). 1619–1629. [35] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104– 3112.