=Paper=
{{Paper
|id=Vol-2657/paper1
|storemode=property
|title=Textual Evidence for the Perfunctoriness of Independent Medical Reviews
|pdfUrl=https://ceur-ws.org/Vol-2657/paper1.pdf
|volume=Vol-2657
|authors=Adrian Brasoveanu,Megan Moodie,Rakshit Agrawal
|dblpUrl=https://dblp.org/rec/conf/kdd/0001MA20
}}
==Textual Evidence for the Perfunctoriness of Independent Medical Reviews==
<pdf width="1500px">https://ceur-ws.org/Vol-2657/paper1.pdf</pdf>
<pre>
           Textual Evidence for the Perfunctoriness of Independent
                              Medical Reviews
              Adrian Brasoveanu                                             Megan Moodie                                         Rakshit Agrawal
               abrsvn@ucsc.edu                                        mmoodie@ucsc.edu                                          ragrawal@camio.com
      University of California Santa Cruz                      University of California Santa Cruz                                   Camio Inc.
                Santa Cruz, CA                                           Santa Cruz, CA                                            San Mateo, CA
ABSTRACT                                                                                ACM Reference Format:
We examine a database of 26,361 Independent Medical Reviews                             Adrian Brasoveanu, Megan Moodie, and Rakshit Agrawal. 2020. Textual Ev-
                                                                                        idence for the Perfunctoriness of Independent Medical Reviews. In Proceed-
(IMRs) for privately insured patients, handled by the California
                                                                                        ings of KDD Workshop on Knowledge-infused Mining and Learning (KiML’20).
Department of Managed Health Care (DMHC) through a private                              , 9 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
contractor. IMR processes are meant to provide protection for pa-
tients whose doctors prescribe treatments that are denied by their
health insurance (either private insurance or the insurance that is
                                                                                        1 INTRODUCTION
part of their worker comp; we focus on private insurance here).                         1.1 Origin and structure of IMRs
Laws requiring IMR were established in California and other states                      Independent Medical Review (IMR) processes are meant to provide
because patients and their doctors were concerned that health in-                       protection for patients whose doctors prescribe treatments that are
surance plans deny coverage for medically necessary services. We                        denied by their health insurance – either private insurance or the
analyze the text of the reviews and compare them closely with a                         insurance that is part of their workers’ compensation. In this paper,
sample of 50000 Yelp reviews [19] and the corpus of 50000 IMDB                          we focus exclusively on privately insured patients. Laws requiring
movie reviews [10]. Despite the fact that the IMDB corpus is twice                      IMR processes were established in California and other states in
as large as the IMR corpus, and the Yelp sample contains almost                         the late 1990s because patients and their doctors were concerned
twice as many reviews, we can construct a very good language                            that health insurance plans deny coverage for medically necessary
model for the IMR corpus using inductive sequential transfer learn-                     services to maximize profit.1
ing, specifically ULMFiT [8], as measured by the quality of text                           As aptly summarized in [1], IMR is regularly used to settle dis-
generation, as well as low perplexity (11.86) and high categorical                      putes between patients and their health insurers over what is medi-
accuracy (0.53) on unseen test data, compared to the larger Yelp                        cally necessary or experimental/investigational care. Medical ne-
and IMDB corpora (perplexity: 40.3 and 37, respectively; accuracy:                      cessity disputes occur between health plans and patients because
0.29 and 0.39). We see similar trends in topic models [17] and clas-                    the health plan disagrees with the patient’s doctor about the ap-
sification models predicting binary IMR outcomes and binarized                          propriate standard of care or course of treatment for a specific
sentiment for Yelp and IMDB reviews. We also examine four other                         condition. Under the current system of managed care in the U.S.,
corpora (drug reviews [6], data science job postings [9], legal case                    services rendered by a health care provider are reviewed to de-
summaries [5] and cooking recipes [11]) to show that the IMR re-                        termine whether the services are medically necessary, a process
sults are not typical for specialized-register corpora. These results                   referred to as utilization review (UR). UR is the oversight mech-
indicate that movie and restaurant reviews exhibit a much larger                        anism through which private insurers control costs by ensuring
variety, more contentful discussion, and greater attention to detail                    that only medically necessary care, covered under the contractual
compared to IMR reviews, which points to the possibility that a                         terms of a patient’s insurance plan, is provided. Services that are
crucial consumer protection mandated by law fails a sizeable class                      not deemed medically necessary or fall outside a particular plan
of highly vulnerable patients.                                                          are not covered.
                                                                                           Procedures or treatment protocols are deemed experimental or
CCS CONCEPTS                                                                            investigational because the health plan – but not necessarily the
  • Computing methodologies → Latent Dirichlet allocation;                              patient’s doctor, who in many cases has enough clinical confidence
Neural networks.                                                                        in a treatment to order it – considers them non-routine medical
                                                                                        care, or takes them to be scientifically unproven to treat the specific
KEYWORDS
                                                                                        condition, illness, or diagnosis for which their use is proposed.
   AI for social good, state-managed medical review processes,                             It is important to realize that the IMR process is usually the
language models, topic models, sentiment classification                                 third and final stage in the medical review process. The typical
                                                                                        progression is as follows. After in-person and possibly repeated
                                                                                        examination of the patient, the doctor recommends a treatment,
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego,           1 For California, see the Friedman-Knowles Act of 1996, requiring California health
California, USA, August 24, 2020. Use permitted under Creative Commons License          plans to provide external independent medical review (IMR) for coverage denials. As
Attribution 4.0 International (CC BY 4.0).                                              of late 2002, 41 states and the District of Columbia had passed legislation creating an
KiML’20, August 24, 2020, San Diego, California, USA,                                   IMR process. In 34 of these states, including California, the decision resulting from the
© 2020 Copyright held by the author(s).                                                 IMR is binding to the health plan. See [1, 15] for summaries of the political and legal
https://doi.org/10.1145/nnnnnnn.nnnnnnn                                                 history of the IMR system, and [2] for an early partial survey of the DMHC IMR data.
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                                      Brasoveanu, Moodie and Agrawal


which is then submitted for approval to the patient’s health plan.          to maximize profit, rather than simply maintain cost effectiveness,
If the treatment is denied in this first stage, both the doctor and         seems to emerge. Typically, the argument for denial contends that
the patient may file an appeal with the health plan, which triggers         the evidence for the beneficial effects of the treatment fails the
a second stage of reviews by the health-insurance provider, for             prevailing standard of scientific evidence. This prevailing standard
which a patient can supply additional information and a doctor              invoked by IMR reviewers is usually randomized control trials
may engage in what is known as a “peer to peer” discussion with a           (RCTs), which are expensive, time-consuming trials that are run by
health-insurance representative. If these second reviews uphold the         large pharmaceutical companies only if the treatment is ultimately
initial denial, the only recourse the patient has is the state-regulated    estimated to be profitable.
IMR process, and per California law, an IMR grievance form (and                RCTs, however, have known limits: they “require minimal as-
some additional information) is included with the denial letter.            sumptions and can operate with little prior knowledge [which] is
    An IMR review must be initiated by the patient and submitted to         an advantage when persuading distrustful audiences, but it is a
the California Department of Managed Health Care (DMHC), which              disadvantage for cumulative scientific progress, where prior knowl-
manages IMRs for privately-insured patients. Motivated treating             edge should be built upon, not discarded.” [3] Inflexibly applying
physicians may provide statements of support for inclusion in the           the RCT “gold standard” in the IMR process is often a way to ig-
documentation provided to DMHC by the patient, but in theory                nore the doctors’ knowledge and experience in a way that seems
the IMR creates a new relationship of care between the review-              superficially well-reasoned and scientific. “RCTs can play a role in
ing physician(s) hired by a private contractor on behalf of DMHC,           building scientific knowledge and useful predictions” – and we add,
and the patient in question. The reviewing physicians’ decision is          treatment recommendations – “only [. . . ] as part of a cumulative
supposed to be made based on what is in the best interest of the pa-        program, [in combination] with other methods.” [3]
tient, not on cost concerns. It is this relation of care that constitutes      Notably, the experimental/investigational category of treatments
the consumer protection for which IMR processes were legislated.            that get denied often includes promising treatments that have not
Understandably, given that the patients in question may be ill or           been fully tested in clinical RCTs – because the treatment is new or
disabled or simply discouraged by several layers of cumbersome              the condition is rare in the population, so treatment development
bureaucratic processes, there is a very high attrition from the initial     costs might not ultimately be recovered. Another common category
review to the final, IMR, stage. That is, only the few highly moti-         of experimental/investigational denials involves “off-label” drug
vated and knowledgeable patients – or the extremely desperate –             uses, that is, uses of FDA-approved pharmaceuticals for a purpose
get as far as the IMR process.                                              other than the narrow one for which the drug was approved.
    The IMR process is regulated by the state, but it is actually con-
ducted by a third party. At this time (2019), the provider in Cali-         1.2     Main argument and predictions
fornia and several other states across the US is MAXIMUS Federal            Recall that these ‘experimental’ treatments or off-label uses are rec-
Services, Inc.2 The costs associated with the IMR review, at least          ommended by the patient’s doctor, and therefore their potential
in California, are covered by health insurers. It is DMHC’s and             benefits are taken to outweigh their possible negative effects. The
MAXIMUS’s responsibility to collect all the documentation from              recommending doctor is likely very familiar with the often lengthy,
the patient, the patient’s doctor(s) and the health insurer. There          tortuous and highly specific medical history of the patient, and with
are no independent checks that all the documentation has actually           the list of ‘less experimental’ treatments that have been proven
been collected, however, and patients do not see a final list of what       unsuccessful or have been removed from consideration for patient-
has been provided to the reviewer prior to the IMR decision itself          specific reasons. It is also important to remember that many rare
(a post facto list of file contents is mailed to patients along with the    conditions have no “on-label” treatment options available, since ex-
final, binding, decision; it is unclear what recourse a patient may         pensive RCTs and treatment approval processes are not undertaken
have if they find pertinent information was missing from the review         if companies do not expect to recover their costs, which is likely if
file). Once the documentation is assembled, MAXIMUS forwards it             the potential ‘market’ is small (few people have the rare condition).
to anywhere from one to three reviewers, who remain anonymous,                  Therefore, our main line of argumentation is as follows.
but are certified by MAXIMUS to be appropriately credentialed
and knowledgeable about the treatment(s) and condition(s) under                   • Since IMRs are the final stage in a long bureaucratic process
review. The reviewer submits a summary of the case, and also a ra-                  in which health insurance companies keep denying coverage
tionale and evidence in support of their decision, which is a binary                for a treatment repeatedly recommended by a doctor as
Upheld/Overturned decision about the medical service. IMR review-                   medically necessary, we expect that the issue of medical
ers do not enter a consultative relationship with the patient, doctor               necessity is non-trivial when that specific patient and that
or health plan – they must render an uphold/overturn decision                       specific treatment are carefully considered.
based solely on the provided medical records. However, as noted                   • We should therefore expect the text of the IMRs, which justi-
above, they are in an implied relationship of care to the patient, a                fies the final determination, to be highly individualized and
point to which we return in the Discussion section below (§4).                      argue for that final decision (whether congruent with the
    While insurance carriers do not provide statistics about the per-               health plan’s decision or not) in a way that involves the par-
centage of requested treatments that are denied in the initial stage,               ticulars of the treatment and the particulars of the patient’s
looking at the process as a whole, a pattern of service denial aimed                medical history and conditions.
                                                                              Thus, we expect a reasoned, thoughtful IMR to not be highly
2 https://www.maximus.com/capability/appeals-imr                            generic and templatic / predictable in nature. For instance, legal
                                                                                                   KiML’20, August 24, 2020, San Diego, California, USA,
Textual Evidence for the Perfunctoriness of Independent Medical Reviews


documents may be highly templatic as they discuss the application            The goal in this paper is to investigate to what extent Natu-
of the same law or policy across many different cases, but a response     ral Language Processing (NLP) / Machine Learning (ML) meth-
carefully considering the specifics of a medical case reaching the        ods that are able to extract insights from large corpora point in
IMR stage is not likely to be similar to many other cases. We only        the same direction, thus mitigating cherry-picking biases that are
expect high similarity and ‘templaticity’ for IMR reviews if they are     sometimes associated with qualitative investigations. In addition
reduced to a more or less automatic application of some prespecified      to the IMR text, we perform a comparative study with additional
set of rules (rubber-stamping).                                           English-language datasets in an attempt to eliminate data-specific
                                                                          and problem-specific biases.

                                                                              • We analyze the text of the IMR reviews and compare them
                                                                                with a sample of 50,000 Yelp reviews [19] and the corpus of
1.3     Main results, and their limits                                          50,000 IMDB movie reviews [10].
Concomitantly with this quantitative study, we conducted prelim-              • As the size of data has significant consequences for language-
inary qualitative research with a focus on pain management and                  model training, and NLP/ML models more generally, we
chronic conditions. We investigated the history of the IMR process,             expect models trained on the Yelp and IMDB corpora to
in addition to having direct experience with it. We had detailed                outperform models trained on the IMR corpus, given that
conversations with doctors in Northern California and on private                the IMDB corpus is twice as large as the IMR corpus, and
social media groups formed around chronic conditions and pain                   the Yelp samples contain almost twice as many reviews.
management. This preliminary research reliably points towards the             • In this paper, we instead demonstrate that we were able
possibility that IMR reviews are perfunctory, and that this crucial             to construct a very good language model for the IMR cor-
consumer protection mandated by law seems to fail for a sizeable                pus using inductive sequential transfer learning, specifically
class of highly vulnerable patients. In this paper, we focus on the             ULMFiT [8], as measured by the quality of text generation.
text of the IMR decisions and attempt to quantify the evidence for            • In addition, the model achieves a much lower perplexity
the perfunctoriness of the IMR process that they provide.                       (11.86) and a higher categorical accuracy (0.53) on unseen
   The text of the IMR findings does not provide unambiguous                    test data, compared to models trained on the larger Yelp
evidence about the quality and appropriateness of the IMR process.              and IMDB corpora (perplexity: 40.3 and 37, respectively;
If we had access to the full, anonymized patient files submitted to             categorical accuracy: 0.29 and 0.39).
the IMR reviewers (in addition to the final IMR decision and the              • We see similar trends in topic models [17] and classifica-
associated text), we might have been able to provide much stronger              tion models predicting binary IMR outcomes and binarized
evidence that IMRs should have a significantly higher percentage of             sentiment for Yelp and IMDB reviews.
overturns, and that the IMR process should be improved in various
ways, e.g., (i) patients should be able to check that all the relevant        These results indicate that movie and restaurant reviews ex-
documentation has been collected and will be reviewed, and (ii)           hibit a much larger variety, more contentful discussion, and greater
the anonymous reviewers should be held to higher standards of             attention to detail compared to IMR reviews. In an attempt to mit-
doctor-patient care. At the very least, one would want to compare         igate confirmation bias, as well as potentially significant register
the reports/letters produced by the patient’s doctor(s) and the IMR       differences between IMRs and movie or restaurant reviews, we
texts. However, such information is not available and there are no        examine four additional corpora: drug reviews [6], data science
visible signs suggesting potential availability in the near future.       job postings [9], legal case summaries [5] and cooking recipes [11].
The information that is made available by DMHC constitutes the            These specialized-register corpora are potentially more similar to
IMR decision – whether to uphold or overturn the health plan              IMRs than IMDB or Yelp: the texts are more likely to be highly
decision –, the anonymized decision letter, and information about         similar, include boilerplate text and have a templatic/standardized
the requested treatment category (also available in the letter). We,      structure. We find that predictability of IMR texts, as measured by
therefore, had to limit ourselves to the text of the DMHC-provided        language-model perplexity and categorical accuracy, is higher than
IMR findings in our empirical analysis.                                   all the comparison datasets by a good margin.
   A qualitative inspection of the corpus of IMR decisions made               Based on these empirical comparisons, we conclude that we
available by the California DMHC site as of June 2019 (a total of         have strong evidence that the IMR reviews are perfunctory and,
26,631 cases spanning the years 2001-2019) indicates that the re-         therefore, that a crucial consumer protection mandated by law
views – as documented in the text of the findings – focus more            seems to fail for a sizeable class of highly vulnerable patients. The
on the review procedure and associated legalese than on the ac-           paper is structured as follows. In Section 2, we discuss the datasets
tual medical history of the patient and the details of the case. For      in detail, with a focus on the nature and characteristics of the IMR
example, decisions for chronic pain management seem to mostly             data. In Section 3, we discuss the models we use to analyze the IMR,
rubber-stamp the Medical Treatment Utilization Schedule (MTUS)            Yelp and IMDB datasets, as well as the four auxiliary corpora (drug
guidelines, with very little consideration of the rarity of the un-       reviews, data science jobs, legals cases and recipes). The section also
derlying condition(s) (see our comments about RCTs above), or             compares and discusses the results of these models. Section 4 puts
a thoughtful evaluation of the risk/benefit profile of the denied         all the results together into an argument for the perfunctoriness of
treatment relative to the specific medical history of the patient         the IMRs. Section 5 concludes the paper and outlines directions for
(assuming this history was adequately documented to begin with).          future work.
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                                             Brasoveanu, Moodie and Agrawal


2 THE DATASETS                                                                            Table 2: Outcome counts and percentages by year

2.1 The IMR dataset                                                                       ReportYear     Total # of cases   Overturned     Upheld
The IMR dataset was obtained from the DMHC website in June                                2001                         28       7 (25%)        21
20193 and was minimally preprocessed. It contains 26,361 cases /                          2002                        695     243 (35%)       452
observations and 14 variables, 4 of which are the most relevant:                          2003                        738     280 (38%)       458
                                                                                          2004                        788     305 (39%)       483
     • TreatmentCategory: the main treatment category;
                                                                                          2005                        959     313 (33%)       646
     • ReportYear: year the case was reported;                                            2006                       1080     442 (41%)       638
     • Determination: indicates if the determination was upheld or                        2007                       1342     571 (43%)       771
       overturned;                                                                        2008                       1521     678 (45%)       843
     • Findings: a summary of the case findings.                                          2009                       1432     641 (45%)       791
   The top 14 treatment categories (with percentages of total ≥ 2%),                      2010                       1453     661 (45%)       792
together with their raw counts and percentages are provided in                            2011                       1435     684 (48%)       751
                                                                                          2012                       1203     589 (49%)       614
Table 1.
                                                                                          2013                       1197     487 (41%)       710
                                                                                          2014                       1433     549 (38%)       884
                Table 1: Top 14 treatment categories                                      2015                       2079    1070 (51%)      1009
                                                                                          2016                       3055    1714 (56%)      1341
            TreatmentCategory           Case count      % of total                        2017                       2953    1391 (47%)      1562
            Pharmacy                          6480            25%                         2018                       2545    1218 (48%)      1327
            Diag Imag & Screen                4187            16%                         2019                        425     209 (49%)       216
            Mental Health                     2599            10%
            DME                               1714             7%
            Gen Surg Proc                     1227             5%
            Orthopedic Proc                   1173             5%
            Rehab/ Svc - Outpt                1157             4%
            Cancer Care                       1029             4%
            Elect/Therm/Radfreq                828             3%
            Reconstr/Plast Proc                825             3%
            Autism Related Tx                  767             3%
            Emergency/Urg Care                 582             2%
            Diag/ MD Eval                      573             2%
            Pain Management                    527             2%                   Figure 1: % Overturned claimed on DMHC site (June 2019)


   The breakdown of cases by patient gender (not recorded for all                   2.2    The comparison datasets
cases) is as follows: Female – 14823 (56%), Male – 10836 (41%), Other
                                                                                    As comparison datasets, we use the IMDB movie-review dataset [10],
– 11 (0.0004%).
                                                                                    which has 50,000 reviews and a binary positive/negative sentiment
   The breakdown by determination (the outcome of the IMR) is:
                                                                                    classification associated with each review. This dataset will be par-
Upheld – 14309 (54%), Overturned – 12052 (46%).
                                                                                    ticularly useful as a baseline for our ULMFiT transfer-learning
   The outcome counts and percentages by year are provided in
                                                                                    language models (and subsequent transfer-learning classification
Table 2. The number of cases for 2019 include only the first 5 months
                                                                                    models), where we show that we obtain results for the IMDB dataset
of the year plus a subset of June 2019.
                                                                                    that are similar to the ones in the original ULMFiT paper [8].
   Interestingly, the DMHC website featured a graphic in June 2019
                                                                                       There are 50,000 movie reviews in the IMDB dataset, evenly split
(Figure 1) that reports the percentage of Overturned outcomes to be
                                                                                    into negative and positive reviews. The histogram of text lengths
64%, a figure that does not accord with any of our data summaries.
                                                                                    for IMDB reviews is provided in Figure 2. The reviews contain a
We intend to follow up on this issue and see if the DMHC can share
                                                                                    total of 11,557,297 words. The mean length of a review is 231.15
their data-analysis pipeline so that we can pinpoint the source(s)
                                                                                    words, with an SD of 171.32.
of this difference.
                                                                                       We select a sample of 50,000 Yelp (mainly restaurant) reviews [19],
   Given that our main goal here is to investigate the text of the
                                                                                    with associated binarized negative/positive evaluations, to provide
IMR findings and its predictiveness with respect to IMR outcomes,
                                                                                    a comparison corpus intermediate between our DMHC dataset and
we provide some general properties of this corpus. The histogram
                                                                                    the IMDB dataset. From a total of 560,000 reviews (evenly split be-
of word counts for the IMR findings (the text associated with each
                                                                                    tween negative and positive), we draw a weighted random sample
case) is provided in Figure 2. There are 26,361 texts, with a total of
                                                                                    with the weights provided by the histogram of text lengths for the
5,584,280 words. Words are identified by splitting texts on white
                                                                                    IMR corpus. The resulting sample contains 25,809 (52%) negative
space (sufficient for our purposes here). The mean length of a text
                                                                                    reviews and 24,191 (48%) positive reviews. The histogram of text
is 211.84 words, with a standard deviation (SD) of 120.58.
                                                                                    lengths for Yelp reviews is also provided in Figure 2. The reviews
3 https://data.chhs.ca.gov/dataset/independent-medical-review-imr-determinations-   contain a total of 7,038,467 words. The mean length of a review is
trend.                                                                              140.77 words, with an SD of 71.09.
                                                                                                                                                                                                                                                                                                                                                                                                                                                           KiML’20, August 24, 2020, San Diego, California, USA,
Textual Evidence for the Perfunctoriness of Independent Medical Reviews


                                                                                (Normalized) # of texts of a given length


                                                                                                                                                                                                                                                (Normalized) # of texts of a given length


                                                                                                                                                                                                                                                                                                                                                                                               (Normalized) # of texts of a given length
                                                                                                                            0.004                                                                                                                                                                                                                                                                                                          0.008
                                                                                                                                                                                                                                                                                            0.005

                                                                                                                            0.003                                                                                                                                                           0.004                                                                                                                                          0.006
                                                                                                                                                                                                                                                                                            0.003
                                                                                                                            0.002                                                                                                                                                                                                                                                                                                          0.004
                                                                                                                                                                                                                                                                                            0.002
                                                                                                                            0.001                                                                                                                                                                                                                                                                                                          0.002
                                                                                                                                                                                                                                                                                            0.001

                                                                                                                            0.000                                                                                                                                                           0.000                                                                                                                                          0.000
                                                                                                                                       0      200        400                                               600      800       1000      1200                                                        0   500       1000                                               1500   2000      2500                                                           0       200          400                                                 600     800
                                                                                                                                                    IMR text length (# of words)                                                                                                                        IMDB-review text length (# of words)                                                                                                                Yelp-review text length (# of words)


                                                                                                                                                     (a) IMR                                                                                                                                               (b) IMDB                                                                                                                                                (c) Yelp

                                                                 Figure 2: Histograms of text lengths (numbers of words per text) for the IMR, IMDB and Yelp corpora
  (Normalized) # of texts of a given length


                                                                                                                                                               (Normalized) # of texts of a given length


                                                                                                                                                                                                                                                                                                                      (Normalized) # of texts of a given length


                                                                                                                                                                                                                                                                                                                                                                                                                                                                          (Normalized) # of texts of a given length
                                              0.010                                                                                                                                                        0.0020                                                                                                                                                 0.00014
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0.007
                                                                                                                                                                                                                                                                                                                                                                  0.00012
                                              0.008                                                                                                                                                                                                                                                                                                                                                                                                                                                                   0.006
                                                                                                                                                                                                           0.0015                                                                                                                                                 0.00010
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0.005
                                              0.006                                                                                                                                                                                                                                                                                                               0.00008
                                                                                                                                                                                                           0.0010                                                                                                                                                                                                                                                                                                     0.004
                                              0.004                                                                                                                                                                                                                                                                                                               0.00006                                                                                                                                             0.003
                                                                                                                                                                                                           0.0005                                                                                                                                                 0.00004                                                                                                                                             0.002
                                              0.002
                                                                                                                                                                                                                                                                                                                                                                  0.00002                                                                                                                                             0.001
                                              0.000                                                                                                                                                        0.0000                                                                                                                                                 0.00000                                                                                                                                             0.000
                                                      0   250      500    750                                               1000    1250   1500   1750                                                                    0   500     1000 1500 2000 2500 3000 3500                                                                                                         0      20000       40000                                               60000   80000                                                              0     500         1000      1500       2000   2500
                                                                Drug-review text length (# of words)                                                                                                                                 DS-job text length (# of words)                                                                                                               Legal-case text length (# of words)                                                                                                                    Recipe text length (# of words)


                                                          (a) Drug Reviews                                                                                                                                                          (b) DS Jobs                                                                                                                                 (c) Legal cases                                                                                                                                       (d) Recipes

                                                                                        Figure 3: Histograms of text lengths (numbers of words per text) for the auxiliary datasets


2.3                                                   Four auxiliary datasets                                                                                                                                                                                                                                                                                          The histogram of text lengths for drug reviews is provided in
We will also analyze four other specialized-register corpora: drug                                                                                                                                                                                                                                                                                                  Figure 3. The reviews contain a total of 11,015,248 words, with a
reviews [6], data science (DS) job postings [9], legal case reports [5]                                                                                                                                                                                                                                                                                             mean length of 83.26 words per review (significantly shorter than
and cooking recipes [11]. The modeling results for these specialized-                                                                                                                                                                                                                                                                                               the IMR/IMDB/Yelp texts) and an SD of 45.73.
register corpora will enable us to better contextualize and evaluate                                                                                                                                                                                                                                                                                                   The DS corpus includes 6,953 job postings (about a quarter of
the modeling results for the IMR, IMDB and Yelp corpora, since                                                                                                                                                                                                                                                                                                      the texts in the IMR corpus), with a total of 3,731,051 words. The
these four auxiliary datasets might be seen as more similar to the                                                                                                                                                                                                                                                                                                  histogram of text lengths is provided in Figure 3. The mean length
IMR corpus than movie or restaurant reviews. The drug-review                                                                                                                                                                                                                                                                                                        of a job posting is 536.61 words (more than twice as long as the
corpus contains reviews of pharmaceutical products, which are                                                                                                                                                                                                                                                                                                       IMR/IMDB/Yelp texts), with an SD of 254.06.
closer in subject matter to IMRs than movie/restaurant reviews.                                                                                                                                                                                                                                                                                                        There are 3,890 legal-case reports (even fewer than DS job post-
The other three corpora are all highly specialized in register, just                                                                                                                                                                                                                                                                                                ings), with a total of 25,954,650 words (about 5 times larger than
like the IMRs, with two of them (DS jobs and legal cases) particularly                                                                                                                                                                                                                                                                                              the IMR corpus). The histogram of text lengths for the legal-case re-
similar to the IMRs in that they involve templatic texts containing                                                                                                                                                                                                                                                                                                 ports is provided in Figure 3. The mean length of a report is 6,672.15
information aimed at a specific professional sub-community.                                                                                                                                                                                                                                                                                                         words (a degree of magnitude longer than IMR/IMDB/Yelp), with a
   These four corpora are very different from each other and from                                                                                                                                                                                                                                                                                                   very high SD of 11,997.98.
the IMR corpus in terms of (i) the number of texts that they contain                                                                                                                                                                                                                                                                                                   Finally, the recipe corpus includes more than 1 million texts:
and (ii) the average text length (number of words per text). Because                                                                                                                                                                                                                                                                                                there are 1,029,719 recipes, with a total of 117,563,275 words (very
of this, there was no obvious way to sample from them and from                                                                                                                                                                                                                                                                                                      large compared to our other corpora). The histogram of text lengths
the IMR, IMDB and Yelp corpora in such a way that the resulting                                                                                                                                                                                                                                                                                                     for the recipes is provided in Figure 3. The mean length of a recipe
samples were both roughly comparable with respect to the total                                                                                                                                                                                                                                                                                                      is 114.17 words (close to the length of a drug review, and roughly
number of texts and average text length, and also large enough to                                                                                                                                                                                                                                                                                                   half of an IMR), with an SD of 90.54.
obtain reliable model estimates. We therefore analyzed these four
corpora as a whole.                                                                                                                                                                                                                                                                                                                                                 3       THE MODELS
   The drug-review corpus includes 132,300 drugs reviews – more                                                                                                                                                                                                                                                                                                     In this section, we analyze the text of the IMR findings and its
than the double the number of texts in the IMDB and Yelp datasets,                                                                                                                                                                                                                                                                                                  predictiveness with respect to IMR outcomes. We systematically
and more than 4 times the number of texts in the IMR dataset. From                                                                                                                                                                                                                                                                                                  compare these results with the corresponding ones for the IMDB
the original corpus of 215,063 reviews, we only retained the reviews                                                                                                                                                                                                                                                                                                and Yelp corpora. The datasets were split into training (80%), vali-
associated with a rating of 10, which we label as positive reviews,                                                                                                                                                                                                                                                                                                 dation (10%) and test (10%) sets. Test sets were only used for the
and a rating of 1 through 5, which we label as negative reviews.4                                                                                                                                                                                                                                                                                                   final model evaluation.
4 We did this so that we have a fairly balanced dataset (68,005 positive drug reviews and
64,295 negative reviews) to estimate classification models like the ones we report for
the IMR, IMDB and Yelp corpora in the next section. For completeness, the drug-review                                                                                                                                                                                                                                                                               accuracy: 77.89%; accuracy of multilayer perceptron with a 1,000-unit hidden layer
classification results on previously unseen test data are as follows: logistic regression                                                                                                                                                                                                                                                                           and a ReLU non-linearity: 83.18%; ULMFiT classification model accuracy: 96.12%.
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                                   Brasoveanu, Moodie and Agrawal


   We start with baseline classification models (logistic regressions       We see that the text of the findings / reviews is highly predictive
and logistic multilayer perceptrons with one hidden layer) to es-        of the associated binary outcomes, with the highest accuracy for the
tablish that the reviews in all three datasets under consideration       IMR dataset despite the fact that it contains half the observations
are highly predictive of the associated binary outcomes. Once the        of the other two data sets. We can therefore turn to a more in-
predictiveness, hence, relevance, of the text is established, we turn    depth analysis of the texts to understand what kind of textual
to an in-depth analysis of the texts themselves by means of topic        justification is used to motivate the IMR binary decisions. To that
and language models. We see that the text of the IMR reviews is          end, we examine and compare the results of two unsupervised/self-
significantly different (more predictable, less diverse / contentful)    supervised types of models: topic models and language models.
when compared to movie and restaurant reviews. We then turn to
a final set of classification models that leverage transfer learning     3.2    Topic models
from the language models to see how predictive the texts can re-
                                                                         Topic modeling [17] is an unsupervised method that distills se-
ally be with respect to the associated binary outcomes. Finally, we
                                                                         mantic properties of words and documents in a corpus in terms of
report the results of estimating language models for the 4 auxiliary
                                                                         probabilistic topics. The most widespread measure for topic model
datasets introduced in the previous section.
                                                                         evaluation is the coherence score [14]. Typically, as we increase
   The main conclusion of this extensive series of models is that
                                                                         the number of topics from very few, say, 4 topics, to more of them,
the IMR corpus is an outlier, and it would be easy to make the
                                                                         we see an increase in coherence score that tends to level out after
IMR process fully automatic: it is pretty straightforward to train
                                                                         a certain number of topics. When modeling the IMDB and Yelp
models that generate high-quality, realistic IMR reviews and gen-
                                                                         datasets, we see exactly this behavior, as shown in Figure 4.
erate binary decisions that are very reliably associated with these
                                                                            In contrast, the 4-topic model has the highest coherence score
reviews. In contrast, movie and restaurant reviews produced by
                                                                         (0.56) for the IMR data set, also shown in Figure 4. Furthermore,
unpaid volunteers (as well as the 4 auxiliary datasets) exhibit more
                                                                         as we add more topics, the coherence score drops. As the word
human-like depth, sophistication and attention to detail, so current
                                                                         clouds for the 4-topic model in Figure 5 show, these 4 topics mostly
NLP models do not perform as well on them.
                                                                         reflect the legalese associated with the IMR review procedure and
                                                                         very little, if anything, of the treatments and conditions that were
3.1     Classification models                                            the main point of the review. In contrast, the corresponding high-
We regress outcomes (Upheld/Overturned for IMR or negative/positive      scoring topic models for the IMDB and Yelp datasets reflect actual
sentiment for IMDB/Yelp) against the text of the corresponding           features of movies, e.g., family-life movies, westerns, musicals etc.,
findings / reviews. For the purposes of these basic classification       or breakfast/lunch places, restaurants, shops, bars, hotels etc.
models, as well as the topics models discussed in the following sub-        Recall that IMRs are the legally-mandated last resort for patients
section, the texts were preprocessed as follows. First, we removed       seeking treatments (usually) ordered by their doctors, and which
stop words; for the IMR dataset, we also removed the following           their health plan refuses to cover. The reviews are conducted ex-
high-frequency words: patient, treatment, reviewer, request, medi-       clusively based on documentation. Putting aside the fact that it is
cal and medically, and for the IMDB dataset, we also removed the         unclear how much effort is taken to ensure that the documentation
words film and movie. After part-of-speech tagging, we retained          is complete, especially for patients with extensive and complicated
only nouns, adjectives, verbs and adverbs, since lexical meanings        health records, we see that relatively little specific information
provide the most useful information for logistic (more generally,        about a patients’ medical history, condition(s), or the recommended
feed-forward) models and topic models. The resulting dictionary          treatments are reflected in the text of these decisions. The text seems
for the IMR dataset had 23,188 unique words. We ensured that             to consist largely of legalese about the IMR process, the health plan
the dictionaries for the IMDB and Yelp datasets were also between        / providers, basic demographic information about the patient, and
23,000 and 24,000 words by eliminating infrequent words. Bounding        generalities about the medical service or therapy requested for the
the dictionaries for each dataset to a similar range helps mitigate      enrollee’s condition.
dataset-specific modeling biases: having differently-sized vocabu-
laries leads to differently-sized parameter spaces for the models.
                                                                         3.3    Language models with transfer learning
   We extracted features by converting each text into sparse bag-of-
words vectors of dictionary length, which recorded how many times        Language models, specifically using neural networks, are usually
each token occurred in the text. These feature representations were      recurrent-network or transformer based architectures designed
the input to all the classifier models we consider in this subsection.   to learn textual distributional patterns in an unsupervised or self-
The multilayer perceptron model had a single hidden layer with           supervised manner. Recurrent-network models – on which we
1,000 units and a ReLU non-linearity. The classification accuracies      focus here – commonly use Long Short-Term Memory (LSTM) [7]
on the test data for all three datasets are provided in Table 3.         “cells,” which are able to learn long-term dependencies in sequences.
                                                                         Representing text as a sequence of words, language models build
                                                                         rich representations of the words, sentences, and their relations
       Table 3: Classification accuracy for basic models
                                                                         within a certain language. We estimate a language model for the
                                                                         IMR corpus using inductive sequential transfer learning, specifically
                                        IMR       IMDB       Yelp        ULMFiT [8]. Just as [8], we use the AWD-LSTM model [12], a vanilla
          logistic regression         90.75%      86.30%   87.62%        LSTM with 4 kinds of dropout regularization, embedding size of
          multilayer perceptron       90.94%      87.14%   88.92%        400, 3 LSTM layers (1,150 units per layer), and a BPTT of size 70.
                                                                                                                                                                                                       KiML’20, August 24, 2020, San Diego, California, USA,
Textual Evidence for the Perfunctoriness of Independent Medical Reviews


                                                                      Coherence scores                       0.40       Coherence scores                                                       0.510       Coherence scores
                          0.55                                                                                                                                                                 0.505
                                                                                                             0.38                                                                              0.500
                          0.54
        Coherence score


                                                                                           Coherence score


                                                                                                                                                                             Coherence score
                                                                                                                                                                                               0.495
                                                                                                             0.36                                                                              0.490
                          0.53
                                                                                                                                                                                               0.485
                          0.52                                                                               0.34                                                                              0.480
                                                                                                                                                                                               0.475
                          0.51                                                                               0.32                                                                              0.470
                                  4   6     8   10      12      14    16     18    20                               4     6      8     10      12      14   16    18   20                              4     6      8     10      12      14   16   18   20
                                                     Num Topics                                                                             Num Topics                                                                         Num Topics

                                                (a) IMR                                                                               (b) IMDB                                                                            (c) Yelp

                                          Figure 4: Coherence scores for topic models (𝑥-axis: number of topics; 𝑦-axis: coherence score)


                                                                                                                                                               for treatment of the patient ’s behavioral health condition
                                                                                                                                                               . The American Psychiatric Association ( APA ) treatment
                                                                                                                                                               guidelines for patients with eating disorders also consider
                                                                                                                                                               PHP acute care to be the most appropriate setting for treat-
                                                                                                                                                               ment , and suggest that patients should be treated in the least
                                                                                                                                                               restrictive setting which is likely to be safe and effective .
                                                                                                                                                               The PHP was initially recommended for patients who were
                                                                                                                                                               based on their own medical needs , but who were
                                                                                                                                                             • The patient was admitted to a skilled nursing facility (
                                                                                                                                                               SNF ) on 12 / 10 / 04 . The submitted documentation states
                                                                                                                                                               the patient was discharged from the hospital on 12 / 22 /
                                                                                                                                                               04 . The following day the patient ’s vital signs were sta-
                                                                                                                                                               ble . The patient had been ambulating to the community
                                                                                                                                                               with assistance with transfers , but has not had any recent
                                                                                                                                                               medical or rehabilitation therapy . The patient had no new
                                                                                                                                                               medical problems and was discharged in stable condition .
       Figure 5: Word clouds for the 4-topic IMR model                                                                                                         The patient has requested reimbursement for the inpatient
                                                                                                                                                               acute rehabilitation services provided
                                                                                                                                                          We see that the IMR language model is highly performant, de-
   The AWD-LSTM model is pretrained on Wikitext-103 [13], con-
                                                                                                                                                      spite the simple model architecture we used, the modest size of
sisting of 28, 595 preprocessed Wikipedia articles, with a total of 103
                                                                                                                                                      the pretraining corpus, and the small size of the IMR corpus. The
million words. This pretrained model is fairly simple (no attention,
                                                                                                                                                      quality of the generated text is also very high, particularly given
skip connections etc.), and the pretraining corpus is of modest size.
                                                                                                                                                      all these limitations.
   To obtain our final language models for the IMR, IMDB and
Yelp corpora, we fine-tune the pretrained AWD-LSTM model using
discriminative [18] and slanted triangular [8, 16] learning rates. We                                                                                 3.4         Classification with transfer learning
do the same kind of minimal text preprocessing as in [8].                                                                                             We further fine-tune the language models discussed in the previous
   The perplexity and categorical accuracy for the 3 language mod-                                                                                    subsection to train classifiers for the three datasets. Following [4, 8],
els are provided in Table 4. The perplexity for the IMR findings is                                                                                   we gradually unfreeze the classifier models to avoid catastrophic
much lower than for the IMDB / Yelp reviews, and the language                                                                                         forgetting.
model can correctly guess the next word more than half the time.                                                                                         The results of evaluating the classifiers on the withheld test
                                                                                                                                                      sets are provided in Table 5. Despite the fact that the IMR dataset
  Table 4: Language-model perplexity and categ. accuracy                                                                                              contains half of the classification observations of the other two
                                                                                                                                                      datasets, we obtain the highest level of accuracy when predicting
                                                                     IMR          IMDB           Yelp                                                 binary Upheld/Overturned decisions based on the text of the IMR
                                 perplexity                          11.86         36.96          40.3                                                findings.
                                 categorical accuracy                 53%           39%           29%
                                                                                                                                                                 Table 5: Accuracy for transfer-learning classifiers
  The IMR language model can generate high quality and largely
coherent text, unlike the IMDB / Yelp models. Two samples of                                                                                                                                                          IMR              IMDB            Yelp
generated text are provided below (the ‘seed’ text is boldfaced).                                                                                                  classification accuracy                          97.12%             94.18%        96.16%
    • The issue in this case is whether the requested partial hos-
      pitalization program ( PHP ) services are medically necessary
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                                                      Brasoveanu, Moodie and Agrawal


Table 6: Comparison of language models across all datasets.                               40
Best performing metrics are boldfaced.                                                                                                                   50
                                                                                          35


                                                                                                                                                              Categorical accuracy (%)
                                                                                                                                                         45
      Dataset              Perplexity        Categorical Accuracy                         30


                                                                             Perplexity
      IMR reviews          11.86             0.53                                         25                                                             40
      Legal cases          18.17             0.43
      DS Jobs              22.14             0.41                                         20                                                             35
      Drug reviews         25.06             0.36                                         15
                                                                                                                                                         30
      Recipes              29.56             0.39
      IMDB                 36.96             0.39


                                                                                                R


                                                                                                         es


                                                                                                                             s


                                                                                                                                      es


                                                                                                                                            DB


                                                                                                                                                    lp
                                                                                                                 s


                                                                                                                          iew
                                                                                                                ob
                                                                                               IM


                                                                                                                                                   Ye
                                                                                                         as


                                                                                                                                  cip


                                                                                                                                           IM
                                                                                                                 j


                                                                                                                      rev
                                                                                                     lc


                                                                                                              DS


                                                                                                                                 Re
      Yelp                 40.3              0.29


                                                                                                    ga


                                                                                                                     ug
                                                                                                    Le


                                                                                                                     Dr
                                                                                                                     Corpora

3.5     Models for auxiliary corpora                                    Figure 6: Comparison of language-model perplexity and cat-
We also estimated topic and language models for the 4 auxiliary         egorical accuracy across all the datasets.
corpora (drug reviews, DS jobs, legal cases and cooking recipes).
The associations between coherence scores and number of topics
for these 4 corpora was similar to the ones plotted in Figure 4 above   these medical reviews to be so much more predictable and generic
for the IMDB and Yelp corpora. For all 4 auxiliary corpora, the best    than less socially consequential reviews of movies and restaurants.
topic models had at least 14 topics, often more, with coherence            What are the ethical and potentially legal consequences of these
scores above 0.5. The quality of the topics was also high, with         findings? First, while state legislators assume we have strong health-
intuitively coherent and contentful topics (just like IMDB / Yelp).     insurance related consumer protections in place, an image DMHC
   The perplexity and accuracy of the ULMFiT language models            goes to great lengths to promote, we find the reviews to be up-
on previously-withheld test data are provided in Table 6, which         holding insurance plan denials at rates that exceed what one might
contains the results for all the 7 datasets under consideration in      expect, given that the treatments in question are frequently being
this paper. We see that the predictability of the IMR corpus, as        ordered by a treating physician, and that the IMR process is the last
reflected in its perplexity and categorical accuracy scores, is still   stage in a bureaucratically laborious (hence high-attrition) process
clearly higher than the 4 auxiliary corpora. The perplexity of the      of appealing health-plan denials.
legal-case corpus (18.17) is somewhat close to the IMR perplexity          Second, given that the IMR process creates an implied relation
(11.86), but we should remember that the legal-case corpus is about     of care between the reviewers hired by MAXIMUS and the patient –
5 times larger than the IMR corpus. Furthermore, the legal-case         since reviewers are, after all, being entrusted with the best interests
categorical accuracy of 43% is still substantially lower than the IMR   of the patient without regard to cost –, one can hardly say that they
accuracy of 53%. Notably, even the recipe corpus, which is about 20     are fulfilling their obligations as doctors to their patient with such
times larger than the IMR corpus (≈ 117.5 vs. ≈ 5.5 million words)      seemingly rote, perfunctory reviews.
does not have test-set scores similar to the IMR scores.                   Third, if IMR processes were designed to make sure that (i) treat-
   The results for these 4 auxiliary corpora indicate that the IMR      ment decisions are being made by doctors, not by profit-driven
corpus is an outlier, with very highly templatic and generic texts.     businesses, and (ii) insurance companies cannot welch on their re-
                                                                        sponsibilities to plan members, one must wonder whether prescrib-
4     DISCUSSION                                                        ing physicians are wrong more than half the time. Do American
The models discussed in the previous section show that language-        doctors really order so many erroneous, medically unnecessary
model learning is significantly easier for IMRs compared to the other   treatments and medications? If so, how is it possible that they are
6 corpora. As can be seen in Table 6, perplexity in the language        so committed and confident in them that they are willing to escalate
model for IMR reviews is clearly lower than even legal cases, for       the appeal process all the way to the state-managed IMR stage?
which we expect highly templatic language and high similarity           Or is it that IMRs often serve as a final rubber stamp for health-
between texts. This pattern can be clearly observed in Figure 6,        insurance plan denials, failing their stated mission of protecting a
with the IMR corpus clearly at the very end of the high-to-low          vulnerable population?
predictability spectrum.                                                   We end this discussion section by briefly reflecting on the way
   One would not expect such highly predictable texts in an ideal       we used ML/NLP methods for social good problems in this paper.
scenario, where each medical review is thorough, and each deci-         Overwhelmingly, the social-good applications of these methods
sion is accompanied by strong medical reasoning relying on the          and models seem to be predictive in nature: their goal is to improve
specifics of the case at hand, and based on an objective physician’s,   the outcomes of a decision-making process, and the improvement
or team of physicians’, opinion as to what is in the patient’s best     is evaluated according to various performance-related metrics. An
interest. Arguably, these medically complex cases are as diverse as     important class of metrics that are currently being developed have
Hollywood blockbusters or fashionable restaurants – the patients        to do with ethical, or ‘safe,’ uses of ML/AI models.
themselves certainly experience them as unique and meaningful              In contrast, our use of ML models in this paper was analytical,
–, and their reviews should be similarly diverse, or at most as tem-    with the goal of extracting insights from large datasets that enable
platic as a job posting or a cooking recipe. We wouldn’t expect         us to empirically evaluate how well an established decision-making
                                                                                                         KiML’20, August 24, 2020, San Diego, California, USA,
Textual Evidence for the Perfunctoriness of Independent Medical Reviews


process with high social impact functions. Data analysis of this          limited to (i) adding ways for patients to check that all the rele-
kind, more akin to hypothesis testing than to predictive modeling,        vant documentation has been collected and will be reviewed, and
is in fact one of the original uses of statistical models / methods.      (ii) identifying ways to hold the anonymous reviewers to higher
    Unfortunately, using ML models in this way does not straightfor-      standards of doctor-patient care.
wardly lead to plots showing how ML models obviously improve
metrics like the efficiency or cost of a process. We think, however,      ACKNOWLEDGMENTS
that there are as many socially beneficial opportunities for this kind    We are grateful to four KDD-KiML anonymous reviewers for their
of data-analysis use of ML modeling as there are for its predictive       comments on an earlier version of this paper. We gratefully acknowl-
uses. The main difference between them seems to be that the data-         edge the support of the NVIDIA Corporation with the donation of
analysis uses do not lead to more-or-less immediately measurable          two Titan V GPUs used for this research, as well as the UCSC Office
products. Instead, they are meant to become part of a larger ar-          of Research and The Humanities Institute for a matching grant to
gument and evaluation of a socially and politically relevant issue,       purchase additional hardware. The usual disclaimers apply.
e.g., the ethical status of current health-insurance related practices
and consumer protections discussed here. What counts as ‘success’         REFERENCES
when ML models are deployed in this way is less immediate, but             [1] Leatrice Berman-Sandler. 2004. Independent Medical Review: Expanding Legal
could provide at least as much social good in the long run.                    Remedies to Achieve Managed Care Accountability. Annals Health Law 13 (2004).
                                                                           [2] Kenneth H. Chuang, Wade M. Aubry, and R. Adams Dudley. 2004. Independent
                                                                               Medical Review Of Health Plan Coverage Denials: Early Trends. Health Affairs
                                                                               23, 6 (2004), 163–169. https://doi.org/10.1377/hlthaff.23.6.163
5    CONCLUSION AND FUTURE WORK                                            [3] Angus Deaton and Nancy Cartwright. 2018. Understanding and misunderstand-
                                                                               ing randomized controlled trials. Social Science and Medicine 210 (2018), 2–21.
We examined a database of 26,361 IMRs handled by the California                https://doi.org/10.1016/j.socscimed.2017.12.005
DMHC through a private contractor. IMR processes are meant to              [4] Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann.
                                                                               2017. Using millions of emoji occurrences to learn any-domain representations for
provide protection for patients whose doctors prescribe treatments             detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on
that are denied by their health insurance.                                     Empirical Methods in Natural Language Processing. Association for Computational
   We found that, in a majority of cases, IMRs uphold the health               Linguistics, Copenhagen, Denmark, 1615–1625. https://doi.org/10.18653/v1/D17-
                                                                               1169
insurance denial, despite DMHC’s claim to the contrary. In addition,       [5] Filippo Galgani and Achim Hoffmann. 2011. LEXA: Towards Automatic Legal
we analyzed the text of the reviews and compared them with a                   Citation Classification. In AI 2010: Advances in Artificial Intelligence, Jiuyong Li
sample of 50,000 Yelp reviews and the IMDB movie review corpus.                (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 445–454.
                                                                           [6] Felix Gräundefineder, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder.
Despite the fact that these corpora are basically twice as large, we           2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain
can construct a very good language model for the IMR corpus,                   and Cross-Data Learning (DH ’18). Association for Computing Machinery, New
                                                                               York, NY, USA, 121–125. https://doi.org/10.1145/3194658.3194677
as measured by the quality of text generation, as well as its low          [7] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.
perplexity and high categorical accuracy on unseen test data. These            Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.
results indicate that movie and restaurant reviews exhibit a much              8.1735
                                                                           [8] Jeremy Howard and Sebastian Ruder. 2018. Fine-tuned Language Models for Text
larger variety, more contentful discussion, and greater attention              Classification. CoRR abs/1801.06146 (2018). arXiv:1801.06146 http://arxiv.org/
to detail compared to IMR reviews, which seem highly templatic                 abs/1801.06146
and perfunctory in comparison. We see similar trends in topic              [9] Shanshan Lu. 2018. Data Scientist Job Market in the U.S. https://www.kaggle.
                                                                               com/sl6149/data-scientist-job-market-in-the-us More info available here: https:
models and classification models predicting binary IMR outcomes                //github.com/Silvialss/projects/tree/master/IndeedWebScraping.
and binarized sentiment for Yelp and IMDB reviews.                        [10] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng,
                                                                               and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis (HLT
   These results were further confirmed by topic and language                  ’11). Association for Computational Linguistics, Stroudsburg, PA, USA, 142–150.
models for four other specialized-register corpora (drug reviews,         [11] Javier Marin, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, Yusuf
data science job postings, legal-case reports and cooking recipes).            Aytar, Ingmar Weber, and Antonio Torralba. 2019. Recipe1M+: A Dataset for
                                                                               Learning Cross-Modal Embeddings for Cooking Recipes and Food Images. IEEE
   We are in the process of extending our datasets with (i) workers’           Trans. Pattern Anal. Mach. Intell. (2019).
comp cases from California and (ii) private insurance cases from          [12] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing
other states. This will enable us to investigate if the reviews for            and Optimizing LSTM Language Models. CoRR abs/1708.02182 (2017).
                                                                          [13] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017.
workers’ comp cases are substantially different from the DMHC                  Pointer Sentinel Mixture Models. CoRR abs/1609.07843 (2017).
IMR data (the percentage of upheld decisions is much higher for           [14] Michael Röder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the
                                                                               Space of Topic Coherence Measures (WSDM ’15). ACM, New York, NY, USA,
workers’ comp: ≈ 90%), as well as if the reviews vary substantially            399–408. https://doi.org/10.1145/2684822.2685324
across states.                                                            [15] Shirley Eiko Sanematsu. 2001. Taking a broader view of treatment disputes
   Another direction for future work is to follow up on our pre-               beyond managed care: Are recent legislative efforts the cure? UCLA Law Review
                                                                               48 (2001).
liminary qualitative research with a survey of patients that have         [16] Leslie N. Smith. 2017. Cyclical learning rates for training neural networks. In
experienced the IMR process to see if these patients agree with the            Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE.
DMHC-promoted message that the IMR process provides strong                     464–472.
                                                                          [17] Mark Steyvers and Tom Griffiths. 2007. Probabilistic Topic Models. Lawrence
consumer protection against unjustified health-plan denials. This              Erlbaum Associates.
could also enable us to verify if the medical documentation col-          [18] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transfer-
                                                                               able are features in deep neural networks?. In Advances in Neural Information
lected during the IMR process is complete and actually taken into              Processing Systems. 3320–3328.
account when the decision is made.                                        [19] Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Con-
   The ultimate upshot of this project would be a list of recommen-            volutional Networks for Text Classification. CoRR abs/1509.01626 (2015).
                                                                               arXiv:1509.01626 http://arxiv.org/abs/1509.01626
dations for the improvement of the IMR process, including but not

</pre>