=Paper=
{{Paper
|id=Vol-2657/paper1
|storemode=property
|title=Textual Evidence for the Perfunctoriness of Independent Medical Reviews
|pdfUrl=https://ceur-ws.org/Vol-2657/paper1.pdf
|volume=Vol-2657
|authors=Adrian Brasoveanu,Megan Moodie,Rakshit Agrawal
|dblpUrl=https://dblp.org/rec/conf/kdd/0001MA20
}}
==Textual Evidence for the Perfunctoriness of Independent Medical Reviews==
Textual Evidence for the Perfunctoriness of Independent
Medical Reviews
Adrian Brasoveanu Megan Moodie Rakshit Agrawal
abrsvn@ucsc.edu mmoodie@ucsc.edu ragrawal@camio.com
University of California Santa Cruz University of California Santa Cruz Camio Inc.
Santa Cruz, CA Santa Cruz, CA San Mateo, CA
ABSTRACT ACM Reference Format:
We examine a database of 26,361 Independent Medical Reviews Adrian Brasoveanu, Megan Moodie, and Rakshit Agrawal. 2020. Textual Ev-
idence for the Perfunctoriness of Independent Medical Reviews. In Proceed-
(IMRs) for privately insured patients, handled by the California
ings of KDD Workshop on Knowledge-infused Mining and Learning (KiML’20).
Department of Managed Health Care (DMHC) through a private , 9 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
contractor. IMR processes are meant to provide protection for pa-
tients whose doctors prescribe treatments that are denied by their
health insurance (either private insurance or the insurance that is
1 INTRODUCTION
part of their worker comp; we focus on private insurance here). 1.1 Origin and structure of IMRs
Laws requiring IMR were established in California and other states Independent Medical Review (IMR) processes are meant to provide
because patients and their doctors were concerned that health in- protection for patients whose doctors prescribe treatments that are
surance plans deny coverage for medically necessary services. We denied by their health insurance – either private insurance or the
analyze the text of the reviews and compare them closely with a insurance that is part of their workers’ compensation. In this paper,
sample of 50000 Yelp reviews [19] and the corpus of 50000 IMDB we focus exclusively on privately insured patients. Laws requiring
movie reviews [10]. Despite the fact that the IMDB corpus is twice IMR processes were established in California and other states in
as large as the IMR corpus, and the Yelp sample contains almost the late 1990s because patients and their doctors were concerned
twice as many reviews, we can construct a very good language that health insurance plans deny coverage for medically necessary
model for the IMR corpus using inductive sequential transfer learn- services to maximize profit.1
ing, specifically ULMFiT [8], as measured by the quality of text As aptly summarized in [1], IMR is regularly used to settle dis-
generation, as well as low perplexity (11.86) and high categorical putes between patients and their health insurers over what is medi-
accuracy (0.53) on unseen test data, compared to the larger Yelp cally necessary or experimental/investigational care. Medical ne-
and IMDB corpora (perplexity: 40.3 and 37, respectively; accuracy: cessity disputes occur between health plans and patients because
0.29 and 0.39). We see similar trends in topic models [17] and clas- the health plan disagrees with the patient’s doctor about the ap-
sification models predicting binary IMR outcomes and binarized propriate standard of care or course of treatment for a specific
sentiment for Yelp and IMDB reviews. We also examine four other condition. Under the current system of managed care in the U.S.,
corpora (drug reviews [6], data science job postings [9], legal case services rendered by a health care provider are reviewed to de-
summaries [5] and cooking recipes [11]) to show that the IMR re- termine whether the services are medically necessary, a process
sults are not typical for specialized-register corpora. These results referred to as utilization review (UR). UR is the oversight mech-
indicate that movie and restaurant reviews exhibit a much larger anism through which private insurers control costs by ensuring
variety, more contentful discussion, and greater attention to detail that only medically necessary care, covered under the contractual
compared to IMR reviews, which points to the possibility that a terms of a patient’s insurance plan, is provided. Services that are
crucial consumer protection mandated by law fails a sizeable class not deemed medically necessary or fall outside a particular plan
of highly vulnerable patients. are not covered.
Procedures or treatment protocols are deemed experimental or
CCS CONCEPTS investigational because the health plan – but not necessarily the
• Computing methodologies → Latent Dirichlet allocation; patient’s doctor, who in many cases has enough clinical confidence
Neural networks. in a treatment to order it – considers them non-routine medical
care, or takes them to be scientifically unproven to treat the specific
KEYWORDS
condition, illness, or diagnosis for which their use is proposed.
AI for social good, state-managed medical review processes, It is important to realize that the IMR process is usually the
language models, topic models, sentiment classification third and final stage in the medical review process. The typical
progression is as follows. After in-person and possibly repeated
examination of the patient, the doctor recommends a treatment,
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, 1 For California, see the Friedman-Knowles Act of 1996, requiring California health
California, USA, August 24, 2020. Use permitted under Creative Commons License plans to provide external independent medical review (IMR) for coverage denials. As
Attribution 4.0 International (CC BY 4.0). of late 2002, 41 states and the District of Columbia had passed legislation creating an
KiML’20, August 24, 2020, San Diego, California, USA, IMR process. In 34 of these states, including California, the decision resulting from the
© 2020 Copyright held by the author(s). IMR is binding to the health plan. See [1, 15] for summaries of the political and legal
https://doi.org/10.1145/nnnnnnn.nnnnnnn history of the IMR system, and [2] for an early partial survey of the DMHC IMR data.
KiML’20, August 24, 2020, San Diego, California, USA,
Brasoveanu, Moodie and Agrawal
which is then submitted for approval to the patient’s health plan. to maximize profit, rather than simply maintain cost effectiveness,
If the treatment is denied in this first stage, both the doctor and seems to emerge. Typically, the argument for denial contends that
the patient may file an appeal with the health plan, which triggers the evidence for the beneficial effects of the treatment fails the
a second stage of reviews by the health-insurance provider, for prevailing standard of scientific evidence. This prevailing standard
which a patient can supply additional information and a doctor invoked by IMR reviewers is usually randomized control trials
may engage in what is known as a “peer to peer” discussion with a (RCTs), which are expensive, time-consuming trials that are run by
health-insurance representative. If these second reviews uphold the large pharmaceutical companies only if the treatment is ultimately
initial denial, the only recourse the patient has is the state-regulated estimated to be profitable.
IMR process, and per California law, an IMR grievance form (and RCTs, however, have known limits: they “require minimal as-
some additional information) is included with the denial letter. sumptions and can operate with little prior knowledge [which] is
An IMR review must be initiated by the patient and submitted to an advantage when persuading distrustful audiences, but it is a
the California Department of Managed Health Care (DMHC), which disadvantage for cumulative scientific progress, where prior knowl-
manages IMRs for privately-insured patients. Motivated treating edge should be built upon, not discarded.” [3] Inflexibly applying
physicians may provide statements of support for inclusion in the the RCT “gold standard” in the IMR process is often a way to ig-
documentation provided to DMHC by the patient, but in theory nore the doctors’ knowledge and experience in a way that seems
the IMR creates a new relationship of care between the review- superficially well-reasoned and scientific. “RCTs can play a role in
ing physician(s) hired by a private contractor on behalf of DMHC, building scientific knowledge and useful predictions” – and we add,
and the patient in question. The reviewing physicians’ decision is treatment recommendations – “only [. . . ] as part of a cumulative
supposed to be made based on what is in the best interest of the pa- program, [in combination] with other methods.” [3]
tient, not on cost concerns. It is this relation of care that constitutes Notably, the experimental/investigational category of treatments
the consumer protection for which IMR processes were legislated. that get denied often includes promising treatments that have not
Understandably, given that the patients in question may be ill or been fully tested in clinical RCTs – because the treatment is new or
disabled or simply discouraged by several layers of cumbersome the condition is rare in the population, so treatment development
bureaucratic processes, there is a very high attrition from the initial costs might not ultimately be recovered. Another common category
review to the final, IMR, stage. That is, only the few highly moti- of experimental/investigational denials involves “off-label” drug
vated and knowledgeable patients – or the extremely desperate – uses, that is, uses of FDA-approved pharmaceuticals for a purpose
get as far as the IMR process. other than the narrow one for which the drug was approved.
The IMR process is regulated by the state, but it is actually con-
ducted by a third party. At this time (2019), the provider in Cali- 1.2 Main argument and predictions
fornia and several other states across the US is MAXIMUS Federal Recall that these ‘experimental’ treatments or off-label uses are rec-
Services, Inc.2 The costs associated with the IMR review, at least ommended by the patient’s doctor, and therefore their potential
in California, are covered by health insurers. It is DMHC’s and benefits are taken to outweigh their possible negative effects. The
MAXIMUS’s responsibility to collect all the documentation from recommending doctor is likely very familiar with the often lengthy,
the patient, the patient’s doctor(s) and the health insurer. There tortuous and highly specific medical history of the patient, and with
are no independent checks that all the documentation has actually the list of ‘less experimental’ treatments that have been proven
been collected, however, and patients do not see a final list of what unsuccessful or have been removed from consideration for patient-
has been provided to the reviewer prior to the IMR decision itself specific reasons. It is also important to remember that many rare
(a post facto list of file contents is mailed to patients along with the conditions have no “on-label” treatment options available, since ex-
final, binding, decision; it is unclear what recourse a patient may pensive RCTs and treatment approval processes are not undertaken
have if they find pertinent information was missing from the review if companies do not expect to recover their costs, which is likely if
file). Once the documentation is assembled, MAXIMUS forwards it the potential ‘market’ is small (few people have the rare condition).
to anywhere from one to three reviewers, who remain anonymous, Therefore, our main line of argumentation is as follows.
but are certified by MAXIMUS to be appropriately credentialed
and knowledgeable about the treatment(s) and condition(s) under • Since IMRs are the final stage in a long bureaucratic process
review. The reviewer submits a summary of the case, and also a ra- in which health insurance companies keep denying coverage
tionale and evidence in support of their decision, which is a binary for a treatment repeatedly recommended by a doctor as
Upheld/Overturned decision about the medical service. IMR review- medically necessary, we expect that the issue of medical
ers do not enter a consultative relationship with the patient, doctor necessity is non-trivial when that specific patient and that
or health plan – they must render an uphold/overturn decision specific treatment are carefully considered.
based solely on the provided medical records. However, as noted • We should therefore expect the text of the IMRs, which justi-
above, they are in an implied relationship of care to the patient, a fies the final determination, to be highly individualized and
point to which we return in the Discussion section below (§4). argue for that final decision (whether congruent with the
While insurance carriers do not provide statistics about the per- health plan’s decision or not) in a way that involves the par-
centage of requested treatments that are denied in the initial stage, ticulars of the treatment and the particulars of the patient’s
looking at the process as a whole, a pattern of service denial aimed medical history and conditions.
Thus, we expect a reasoned, thoughtful IMR to not be highly
2 https://www.maximus.com/capability/appeals-imr generic and templatic / predictable in nature. For instance, legal
KiML’20, August 24, 2020, San Diego, California, USA,
Textual Evidence for the Perfunctoriness of Independent Medical Reviews
documents may be highly templatic as they discuss the application The goal in this paper is to investigate to what extent Natu-
of the same law or policy across many different cases, but a response ral Language Processing (NLP) / Machine Learning (ML) meth-
carefully considering the specifics of a medical case reaching the ods that are able to extract insights from large corpora point in
IMR stage is not likely to be similar to many other cases. We only the same direction, thus mitigating cherry-picking biases that are
expect high similarity and ‘templaticity’ for IMR reviews if they are sometimes associated with qualitative investigations. In addition
reduced to a more or less automatic application of some prespecified to the IMR text, we perform a comparative study with additional
set of rules (rubber-stamping). English-language datasets in an attempt to eliminate data-specific
and problem-specific biases.
• We analyze the text of the IMR reviews and compare them
with a sample of 50,000 Yelp reviews [19] and the corpus of
1.3 Main results, and their limits 50,000 IMDB movie reviews [10].
Concomitantly with this quantitative study, we conducted prelim- • As the size of data has significant consequences for language-
inary qualitative research with a focus on pain management and model training, and NLP/ML models more generally, we
chronic conditions. We investigated the history of the IMR process, expect models trained on the Yelp and IMDB corpora to
in addition to having direct experience with it. We had detailed outperform models trained on the IMR corpus, given that
conversations with doctors in Northern California and on private the IMDB corpus is twice as large as the IMR corpus, and
social media groups formed around chronic conditions and pain the Yelp samples contain almost twice as many reviews.
management. This preliminary research reliably points towards the • In this paper, we instead demonstrate that we were able
possibility that IMR reviews are perfunctory, and that this crucial to construct a very good language model for the IMR cor-
consumer protection mandated by law seems to fail for a sizeable pus using inductive sequential transfer learning, specifically
class of highly vulnerable patients. In this paper, we focus on the ULMFiT [8], as measured by the quality of text generation.
text of the IMR decisions and attempt to quantify the evidence for • In addition, the model achieves a much lower perplexity
the perfunctoriness of the IMR process that they provide. (11.86) and a higher categorical accuracy (0.53) on unseen
The text of the IMR findings does not provide unambiguous test data, compared to models trained on the larger Yelp
evidence about the quality and appropriateness of the IMR process. and IMDB corpora (perplexity: 40.3 and 37, respectively;
If we had access to the full, anonymized patient files submitted to categorical accuracy: 0.29 and 0.39).
the IMR reviewers (in addition to the final IMR decision and the • We see similar trends in topic models [17] and classifica-
associated text), we might have been able to provide much stronger tion models predicting binary IMR outcomes and binarized
evidence that IMRs should have a significantly higher percentage of sentiment for Yelp and IMDB reviews.
overturns, and that the IMR process should be improved in various
ways, e.g., (i) patients should be able to check that all the relevant These results indicate that movie and restaurant reviews ex-
documentation has been collected and will be reviewed, and (ii) hibit a much larger variety, more contentful discussion, and greater
the anonymous reviewers should be held to higher standards of attention to detail compared to IMR reviews. In an attempt to mit-
doctor-patient care. At the very least, one would want to compare igate confirmation bias, as well as potentially significant register
the reports/letters produced by the patient’s doctor(s) and the IMR differences between IMRs and movie or restaurant reviews, we
texts. However, such information is not available and there are no examine four additional corpora: drug reviews [6], data science
visible signs suggesting potential availability in the near future. job postings [9], legal case summaries [5] and cooking recipes [11].
The information that is made available by DMHC constitutes the These specialized-register corpora are potentially more similar to
IMR decision – whether to uphold or overturn the health plan IMRs than IMDB or Yelp: the texts are more likely to be highly
decision –, the anonymized decision letter, and information about similar, include boilerplate text and have a templatic/standardized
the requested treatment category (also available in the letter). We, structure. We find that predictability of IMR texts, as measured by
therefore, had to limit ourselves to the text of the DMHC-provided language-model perplexity and categorical accuracy, is higher than
IMR findings in our empirical analysis. all the comparison datasets by a good margin.
A qualitative inspection of the corpus of IMR decisions made Based on these empirical comparisons, we conclude that we
available by the California DMHC site as of June 2019 (a total of have strong evidence that the IMR reviews are perfunctory and,
26,631 cases spanning the years 2001-2019) indicates that the re- therefore, that a crucial consumer protection mandated by law
views – as documented in the text of the findings – focus more seems to fail for a sizeable class of highly vulnerable patients. The
on the review procedure and associated legalese than on the ac- paper is structured as follows. In Section 2, we discuss the datasets
tual medical history of the patient and the details of the case. For in detail, with a focus on the nature and characteristics of the IMR
example, decisions for chronic pain management seem to mostly data. In Section 3, we discuss the models we use to analyze the IMR,
rubber-stamp the Medical Treatment Utilization Schedule (MTUS) Yelp and IMDB datasets, as well as the four auxiliary corpora (drug
guidelines, with very little consideration of the rarity of the un- reviews, data science jobs, legals cases and recipes). The section also
derlying condition(s) (see our comments about RCTs above), or compares and discusses the results of these models. Section 4 puts
a thoughtful evaluation of the risk/benefit profile of the denied all the results together into an argument for the perfunctoriness of
treatment relative to the specific medical history of the patient the IMRs. Section 5 concludes the paper and outlines directions for
(assuming this history was adequately documented to begin with). future work.
KiML’20, August 24, 2020, San Diego, California, USA,
Brasoveanu, Moodie and Agrawal
2 THE DATASETS Table 2: Outcome counts and percentages by year
2.1 The IMR dataset ReportYear Total # of cases Overturned Upheld
The IMR dataset was obtained from the DMHC website in June 2001 28 7 (25%) 21
20193 and was minimally preprocessed. It contains 26,361 cases / 2002 695 243 (35%) 452
observations and 14 variables, 4 of which are the most relevant: 2003 738 280 (38%) 458
2004 788 305 (39%) 483
• TreatmentCategory: the main treatment category;
2005 959 313 (33%) 646
• ReportYear: year the case was reported; 2006 1080 442 (41%) 638
• Determination: indicates if the determination was upheld or 2007 1342 571 (43%) 771
overturned; 2008 1521 678 (45%) 843
• Findings: a summary of the case findings. 2009 1432 641 (45%) 791
The top 14 treatment categories (with percentages of total ≥ 2%), 2010 1453 661 (45%) 792
together with their raw counts and percentages are provided in 2011 1435 684 (48%) 751
2012 1203 589 (49%) 614
Table 1.
2013 1197 487 (41%) 710
2014 1433 549 (38%) 884
Table 1: Top 14 treatment categories 2015 2079 1070 (51%) 1009
2016 3055 1714 (56%) 1341
TreatmentCategory Case count % of total 2017 2953 1391 (47%) 1562
Pharmacy 6480 25% 2018 2545 1218 (48%) 1327
Diag Imag & Screen 4187 16% 2019 425 209 (49%) 216
Mental Health 2599 10%
DME 1714 7%
Gen Surg Proc 1227 5%
Orthopedic Proc 1173 5%
Rehab/ Svc - Outpt 1157 4%
Cancer Care 1029 4%
Elect/Therm/Radfreq 828 3%
Reconstr/Plast Proc 825 3%
Autism Related Tx 767 3%
Emergency/Urg Care 582 2%
Diag/ MD Eval 573 2%
Pain Management 527 2% Figure 1: % Overturned claimed on DMHC site (June 2019)
The breakdown of cases by patient gender (not recorded for all 2.2 The comparison datasets
cases) is as follows: Female – 14823 (56%), Male – 10836 (41%), Other
As comparison datasets, we use the IMDB movie-review dataset [10],
– 11 (0.0004%).
which has 50,000 reviews and a binary positive/negative sentiment
The breakdown by determination (the outcome of the IMR) is:
classification associated with each review. This dataset will be par-
Upheld – 14309 (54%), Overturned – 12052 (46%).
ticularly useful as a baseline for our ULMFiT transfer-learning
The outcome counts and percentages by year are provided in
language models (and subsequent transfer-learning classification
Table 2. The number of cases for 2019 include only the first 5 months
models), where we show that we obtain results for the IMDB dataset
of the year plus a subset of June 2019.
that are similar to the ones in the original ULMFiT paper [8].
Interestingly, the DMHC website featured a graphic in June 2019
There are 50,000 movie reviews in the IMDB dataset, evenly split
(Figure 1) that reports the percentage of Overturned outcomes to be
into negative and positive reviews. The histogram of text lengths
64%, a figure that does not accord with any of our data summaries.
for IMDB reviews is provided in Figure 2. The reviews contain a
We intend to follow up on this issue and see if the DMHC can share
total of 11,557,297 words. The mean length of a review is 231.15
their data-analysis pipeline so that we can pinpoint the source(s)
words, with an SD of 171.32.
of this difference.
We select a sample of 50,000 Yelp (mainly restaurant) reviews [19],
Given that our main goal here is to investigate the text of the
with associated binarized negative/positive evaluations, to provide
IMR findings and its predictiveness with respect to IMR outcomes,
a comparison corpus intermediate between our DMHC dataset and
we provide some general properties of this corpus. The histogram
the IMDB dataset. From a total of 560,000 reviews (evenly split be-
of word counts for the IMR findings (the text associated with each
tween negative and positive), we draw a weighted random sample
case) is provided in Figure 2. There are 26,361 texts, with a total of
with the weights provided by the histogram of text lengths for the
5,584,280 words. Words are identified by splitting texts on white
IMR corpus. The resulting sample contains 25,809 (52%) negative
space (sufficient for our purposes here). The mean length of a text
reviews and 24,191 (48%) positive reviews. The histogram of text
is 211.84 words, with a standard deviation (SD) of 120.58.
lengths for Yelp reviews is also provided in Figure 2. The reviews
3 https://data.chhs.ca.gov/dataset/independent-medical-review-imr-determinations- contain a total of 7,038,467 words. The mean length of a review is
trend. 140.77 words, with an SD of 71.09.
KiML’20, August 24, 2020, San Diego, California, USA,
Textual Evidence for the Perfunctoriness of Independent Medical Reviews
(Normalized) # of texts of a given length
(Normalized) # of texts of a given length
(Normalized) # of texts of a given length
0.004 0.008
0.005
0.003 0.004 0.006
0.003
0.002 0.004
0.002
0.001 0.002
0.001
0.000 0.000 0.000
0 200 400 600 800 1000 1200 0 500 1000 1500 2000 2500 0 200 400 600 800
IMR text length (# of words) IMDB-review text length (# of words) Yelp-review text length (# of words)
(a) IMR (b) IMDB (c) Yelp
Figure 2: Histograms of text lengths (numbers of words per text) for the IMR, IMDB and Yelp corpora
(Normalized) # of texts of a given length
(Normalized) # of texts of a given length
(Normalized) # of texts of a given length
(Normalized) # of texts of a given length
0.010 0.0020 0.00014
0.007
0.00012
0.008 0.006
0.0015 0.00010
0.005
0.006 0.00008
0.0010 0.004
0.004 0.00006 0.003
0.0005 0.00004 0.002
0.002
0.00002 0.001
0.000 0.0000 0.00000 0.000
0 250 500 750 1000 1250 1500 1750 0 500 1000 1500 2000 2500 3000 3500 0 20000 40000 60000 80000 0 500 1000 1500 2000 2500
Drug-review text length (# of words) DS-job text length (# of words) Legal-case text length (# of words) Recipe text length (# of words)
(a) Drug Reviews (b) DS Jobs (c) Legal cases (d) Recipes
Figure 3: Histograms of text lengths (numbers of words per text) for the auxiliary datasets
2.3 Four auxiliary datasets The histogram of text lengths for drug reviews is provided in
We will also analyze four other specialized-register corpora: drug Figure 3. The reviews contain a total of 11,015,248 words, with a
reviews [6], data science (DS) job postings [9], legal case reports [5] mean length of 83.26 words per review (significantly shorter than
and cooking recipes [11]. The modeling results for these specialized- the IMR/IMDB/Yelp texts) and an SD of 45.73.
register corpora will enable us to better contextualize and evaluate The DS corpus includes 6,953 job postings (about a quarter of
the modeling results for the IMR, IMDB and Yelp corpora, since the texts in the IMR corpus), with a total of 3,731,051 words. The
these four auxiliary datasets might be seen as more similar to the histogram of text lengths is provided in Figure 3. The mean length
IMR corpus than movie or restaurant reviews. The drug-review of a job posting is 536.61 words (more than twice as long as the
corpus contains reviews of pharmaceutical products, which are IMR/IMDB/Yelp texts), with an SD of 254.06.
closer in subject matter to IMRs than movie/restaurant reviews. There are 3,890 legal-case reports (even fewer than DS job post-
The other three corpora are all highly specialized in register, just ings), with a total of 25,954,650 words (about 5 times larger than
like the IMRs, with two of them (DS jobs and legal cases) particularly the IMR corpus). The histogram of text lengths for the legal-case re-
similar to the IMRs in that they involve templatic texts containing ports is provided in Figure 3. The mean length of a report is 6,672.15
information aimed at a specific professional sub-community. words (a degree of magnitude longer than IMR/IMDB/Yelp), with a
These four corpora are very different from each other and from very high SD of 11,997.98.
the IMR corpus in terms of (i) the number of texts that they contain Finally, the recipe corpus includes more than 1 million texts:
and (ii) the average text length (number of words per text). Because there are 1,029,719 recipes, with a total of 117,563,275 words (very
of this, there was no obvious way to sample from them and from large compared to our other corpora). The histogram of text lengths
the IMR, IMDB and Yelp corpora in such a way that the resulting for the recipes is provided in Figure 3. The mean length of a recipe
samples were both roughly comparable with respect to the total is 114.17 words (close to the length of a drug review, and roughly
number of texts and average text length, and also large enough to half of an IMR), with an SD of 90.54.
obtain reliable model estimates. We therefore analyzed these four
corpora as a whole. 3 THE MODELS
The drug-review corpus includes 132,300 drugs reviews – more In this section, we analyze the text of the IMR findings and its
than the double the number of texts in the IMDB and Yelp datasets, predictiveness with respect to IMR outcomes. We systematically
and more than 4 times the number of texts in the IMR dataset. From compare these results with the corresponding ones for the IMDB
the original corpus of 215,063 reviews, we only retained the reviews and Yelp corpora. The datasets were split into training (80%), vali-
associated with a rating of 10, which we label as positive reviews, dation (10%) and test (10%) sets. Test sets were only used for the
and a rating of 1 through 5, which we label as negative reviews.4 final model evaluation.
4 We did this so that we have a fairly balanced dataset (68,005 positive drug reviews and
64,295 negative reviews) to estimate classification models like the ones we report for
the IMR, IMDB and Yelp corpora in the next section. For completeness, the drug-review accuracy: 77.89%; accuracy of multilayer perceptron with a 1,000-unit hidden layer
classification results on previously unseen test data are as follows: logistic regression and a ReLU non-linearity: 83.18%; ULMFiT classification model accuracy: 96.12%.
KiML’20, August 24, 2020, San Diego, California, USA,
Brasoveanu, Moodie and Agrawal
We start with baseline classification models (logistic regressions We see that the text of the findings / reviews is highly predictive
and logistic multilayer perceptrons with one hidden layer) to es- of the associated binary outcomes, with the highest accuracy for the
tablish that the reviews in all three datasets under consideration IMR dataset despite the fact that it contains half the observations
are highly predictive of the associated binary outcomes. Once the of the other two data sets. We can therefore turn to a more in-
predictiveness, hence, relevance, of the text is established, we turn depth analysis of the texts to understand what kind of textual
to an in-depth analysis of the texts themselves by means of topic justification is used to motivate the IMR binary decisions. To that
and language models. We see that the text of the IMR reviews is end, we examine and compare the results of two unsupervised/self-
significantly different (more predictable, less diverse / contentful) supervised types of models: topic models and language models.
when compared to movie and restaurant reviews. We then turn to
a final set of classification models that leverage transfer learning 3.2 Topic models
from the language models to see how predictive the texts can re-
Topic modeling [17] is an unsupervised method that distills se-
ally be with respect to the associated binary outcomes. Finally, we
mantic properties of words and documents in a corpus in terms of
report the results of estimating language models for the 4 auxiliary
probabilistic topics. The most widespread measure for topic model
datasets introduced in the previous section.
evaluation is the coherence score [14]. Typically, as we increase
The main conclusion of this extensive series of models is that
the number of topics from very few, say, 4 topics, to more of them,
the IMR corpus is an outlier, and it would be easy to make the
we see an increase in coherence score that tends to level out after
IMR process fully automatic: it is pretty straightforward to train
a certain number of topics. When modeling the IMDB and Yelp
models that generate high-quality, realistic IMR reviews and gen-
datasets, we see exactly this behavior, as shown in Figure 4.
erate binary decisions that are very reliably associated with these
In contrast, the 4-topic model has the highest coherence score
reviews. In contrast, movie and restaurant reviews produced by
(0.56) for the IMR data set, also shown in Figure 4. Furthermore,
unpaid volunteers (as well as the 4 auxiliary datasets) exhibit more
as we add more topics, the coherence score drops. As the word
human-like depth, sophistication and attention to detail, so current
clouds for the 4-topic model in Figure 5 show, these 4 topics mostly
NLP models do not perform as well on them.
reflect the legalese associated with the IMR review procedure and
very little, if anything, of the treatments and conditions that were
3.1 Classification models the main point of the review. In contrast, the corresponding high-
We regress outcomes (Upheld/Overturned for IMR or negative/positive scoring topic models for the IMDB and Yelp datasets reflect actual
sentiment for IMDB/Yelp) against the text of the corresponding features of movies, e.g., family-life movies, westerns, musicals etc.,
findings / reviews. For the purposes of these basic classification or breakfast/lunch places, restaurants, shops, bars, hotels etc.
models, as well as the topics models discussed in the following sub- Recall that IMRs are the legally-mandated last resort for patients
section, the texts were preprocessed as follows. First, we removed seeking treatments (usually) ordered by their doctors, and which
stop words; for the IMR dataset, we also removed the following their health plan refuses to cover. The reviews are conducted ex-
high-frequency words: patient, treatment, reviewer, request, medi- clusively based on documentation. Putting aside the fact that it is
cal and medically, and for the IMDB dataset, we also removed the unclear how much effort is taken to ensure that the documentation
words film and movie. After part-of-speech tagging, we retained is complete, especially for patients with extensive and complicated
only nouns, adjectives, verbs and adverbs, since lexical meanings health records, we see that relatively little specific information
provide the most useful information for logistic (more generally, about a patients’ medical history, condition(s), or the recommended
feed-forward) models and topic models. The resulting dictionary treatments are reflected in the text of these decisions. The text seems
for the IMR dataset had 23,188 unique words. We ensured that to consist largely of legalese about the IMR process, the health plan
the dictionaries for the IMDB and Yelp datasets were also between / providers, basic demographic information about the patient, and
23,000 and 24,000 words by eliminating infrequent words. Bounding generalities about the medical service or therapy requested for the
the dictionaries for each dataset to a similar range helps mitigate enrollee’s condition.
dataset-specific modeling biases: having differently-sized vocabu-
laries leads to differently-sized parameter spaces for the models.
3.3 Language models with transfer learning
We extracted features by converting each text into sparse bag-of-
words vectors of dictionary length, which recorded how many times Language models, specifically using neural networks, are usually
each token occurred in the text. These feature representations were recurrent-network or transformer based architectures designed
the input to all the classifier models we consider in this subsection. to learn textual distributional patterns in an unsupervised or self-
The multilayer perceptron model had a single hidden layer with supervised manner. Recurrent-network models – on which we
1,000 units and a ReLU non-linearity. The classification accuracies focus here – commonly use Long Short-Term Memory (LSTM) [7]
on the test data for all three datasets are provided in Table 3. “cells,” which are able to learn long-term dependencies in sequences.
Representing text as a sequence of words, language models build
rich representations of the words, sentences, and their relations
Table 3: Classification accuracy for basic models
within a certain language. We estimate a language model for the
IMR corpus using inductive sequential transfer learning, specifically
IMR IMDB Yelp ULMFiT [8]. Just as [8], we use the AWD-LSTM model [12], a vanilla
logistic regression 90.75% 86.30% 87.62% LSTM with 4 kinds of dropout regularization, embedding size of
multilayer perceptron 90.94% 87.14% 88.92% 400, 3 LSTM layers (1,150 units per layer), and a BPTT of size 70.
KiML’20, August 24, 2020, San Diego, California, USA,
Textual Evidence for the Perfunctoriness of Independent Medical Reviews
Coherence scores 0.40 Coherence scores 0.510 Coherence scores
0.55 0.505
0.38 0.500
0.54
Coherence score
Coherence score
Coherence score
0.495
0.36 0.490
0.53
0.485
0.52 0.34 0.480
0.475
0.51 0.32 0.470
4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20
Num Topics Num Topics Num Topics
(a) IMR (b) IMDB (c) Yelp
Figure 4: Coherence scores for topic models (𝑥-axis: number of topics; 𝑦-axis: coherence score)
for treatment of the patient ’s behavioral health condition
. The American Psychiatric Association ( APA ) treatment
guidelines for patients with eating disorders also consider
PHP acute care to be the most appropriate setting for treat-
ment , and suggest that patients should be treated in the least
restrictive setting which is likely to be safe and effective .
The PHP was initially recommended for patients who were
based on their own medical needs , but who were
• The patient was admitted to a skilled nursing facility (
SNF ) on 12 / 10 / 04 . The submitted documentation states
the patient was discharged from the hospital on 12 / 22 /
04 . The following day the patient ’s vital signs were sta-
ble . The patient had been ambulating to the community
with assistance with transfers , but has not had any recent
medical or rehabilitation therapy . The patient had no new
medical problems and was discharged in stable condition .
Figure 5: Word clouds for the 4-topic IMR model The patient has requested reimbursement for the inpatient
acute rehabilitation services provided
We see that the IMR language model is highly performant, de-
The AWD-LSTM model is pretrained on Wikitext-103 [13], con-
spite the simple model architecture we used, the modest size of
sisting of 28, 595 preprocessed Wikipedia articles, with a total of 103
the pretraining corpus, and the small size of the IMR corpus. The
million words. This pretrained model is fairly simple (no attention,
quality of the generated text is also very high, particularly given
skip connections etc.), and the pretraining corpus is of modest size.
all these limitations.
To obtain our final language models for the IMR, IMDB and
Yelp corpora, we fine-tune the pretrained AWD-LSTM model using
discriminative [18] and slanted triangular [8, 16] learning rates. We 3.4 Classification with transfer learning
do the same kind of minimal text preprocessing as in [8]. We further fine-tune the language models discussed in the previous
The perplexity and categorical accuracy for the 3 language mod- subsection to train classifiers for the three datasets. Following [4, 8],
els are provided in Table 4. The perplexity for the IMR findings is we gradually unfreeze the classifier models to avoid catastrophic
much lower than for the IMDB / Yelp reviews, and the language forgetting.
model can correctly guess the next word more than half the time. The results of evaluating the classifiers on the withheld test
sets are provided in Table 5. Despite the fact that the IMR dataset
Table 4: Language-model perplexity and categ. accuracy contains half of the classification observations of the other two
datasets, we obtain the highest level of accuracy when predicting
IMR IMDB Yelp binary Upheld/Overturned decisions based on the text of the IMR
perplexity 11.86 36.96 40.3 findings.
categorical accuracy 53% 39% 29%
Table 5: Accuracy for transfer-learning classifiers
The IMR language model can generate high quality and largely
coherent text, unlike the IMDB / Yelp models. Two samples of IMR IMDB Yelp
generated text are provided below (the ‘seed’ text is boldfaced). classification accuracy 97.12% 94.18% 96.16%
• The issue in this case is whether the requested partial hos-
pitalization program ( PHP ) services are medically necessary
KiML’20, August 24, 2020, San Diego, California, USA,
Brasoveanu, Moodie and Agrawal
Table 6: Comparison of language models across all datasets. 40
Best performing metrics are boldfaced. 50
35
Categorical accuracy (%)
45
Dataset Perplexity Categorical Accuracy 30
Perplexity
IMR reviews 11.86 0.53 25 40
Legal cases 18.17 0.43
DS Jobs 22.14 0.41 20 35
Drug reviews 25.06 0.36 15
30
Recipes 29.56 0.39
IMDB 36.96 0.39
R
es
s
es
DB
lp
s
iew
ob
IM
Ye
as
cip
IM
j
rev
lc
DS
Re
Yelp 40.3 0.29
ga
ug
Le
Dr
Corpora
3.5 Models for auxiliary corpora Figure 6: Comparison of language-model perplexity and cat-
We also estimated topic and language models for the 4 auxiliary egorical accuracy across all the datasets.
corpora (drug reviews, DS jobs, legal cases and cooking recipes).
The associations between coherence scores and number of topics
for these 4 corpora was similar to the ones plotted in Figure 4 above these medical reviews to be so much more predictable and generic
for the IMDB and Yelp corpora. For all 4 auxiliary corpora, the best than less socially consequential reviews of movies and restaurants.
topic models had at least 14 topics, often more, with coherence What are the ethical and potentially legal consequences of these
scores above 0.5. The quality of the topics was also high, with findings? First, while state legislators assume we have strong health-
intuitively coherent and contentful topics (just like IMDB / Yelp). insurance related consumer protections in place, an image DMHC
The perplexity and accuracy of the ULMFiT language models goes to great lengths to promote, we find the reviews to be up-
on previously-withheld test data are provided in Table 6, which holding insurance plan denials at rates that exceed what one might
contains the results for all the 7 datasets under consideration in expect, given that the treatments in question are frequently being
this paper. We see that the predictability of the IMR corpus, as ordered by a treating physician, and that the IMR process is the last
reflected in its perplexity and categorical accuracy scores, is still stage in a bureaucratically laborious (hence high-attrition) process
clearly higher than the 4 auxiliary corpora. The perplexity of the of appealing health-plan denials.
legal-case corpus (18.17) is somewhat close to the IMR perplexity Second, given that the IMR process creates an implied relation
(11.86), but we should remember that the legal-case corpus is about of care between the reviewers hired by MAXIMUS and the patient –
5 times larger than the IMR corpus. Furthermore, the legal-case since reviewers are, after all, being entrusted with the best interests
categorical accuracy of 43% is still substantially lower than the IMR of the patient without regard to cost –, one can hardly say that they
accuracy of 53%. Notably, even the recipe corpus, which is about 20 are fulfilling their obligations as doctors to their patient with such
times larger than the IMR corpus (≈ 117.5 vs. ≈ 5.5 million words) seemingly rote, perfunctory reviews.
does not have test-set scores similar to the IMR scores. Third, if IMR processes were designed to make sure that (i) treat-
The results for these 4 auxiliary corpora indicate that the IMR ment decisions are being made by doctors, not by profit-driven
corpus is an outlier, with very highly templatic and generic texts. businesses, and (ii) insurance companies cannot welch on their re-
sponsibilities to plan members, one must wonder whether prescrib-
4 DISCUSSION ing physicians are wrong more than half the time. Do American
The models discussed in the previous section show that language- doctors really order so many erroneous, medically unnecessary
model learning is significantly easier for IMRs compared to the other treatments and medications? If so, how is it possible that they are
6 corpora. As can be seen in Table 6, perplexity in the language so committed and confident in them that they are willing to escalate
model for IMR reviews is clearly lower than even legal cases, for the appeal process all the way to the state-managed IMR stage?
which we expect highly templatic language and high similarity Or is it that IMRs often serve as a final rubber stamp for health-
between texts. This pattern can be clearly observed in Figure 6, insurance plan denials, failing their stated mission of protecting a
with the IMR corpus clearly at the very end of the high-to-low vulnerable population?
predictability spectrum. We end this discussion section by briefly reflecting on the way
One would not expect such highly predictable texts in an ideal we used ML/NLP methods for social good problems in this paper.
scenario, where each medical review is thorough, and each deci- Overwhelmingly, the social-good applications of these methods
sion is accompanied by strong medical reasoning relying on the and models seem to be predictive in nature: their goal is to improve
specifics of the case at hand, and based on an objective physician’s, the outcomes of a decision-making process, and the improvement
or team of physicians’, opinion as to what is in the patient’s best is evaluated according to various performance-related metrics. An
interest. Arguably, these medically complex cases are as diverse as important class of metrics that are currently being developed have
Hollywood blockbusters or fashionable restaurants – the patients to do with ethical, or ‘safe,’ uses of ML/AI models.
themselves certainly experience them as unique and meaningful In contrast, our use of ML models in this paper was analytical,
–, and their reviews should be similarly diverse, or at most as tem- with the goal of extracting insights from large datasets that enable
platic as a job posting or a cooking recipe. We wouldn’t expect us to empirically evaluate how well an established decision-making
KiML’20, August 24, 2020, San Diego, California, USA,
Textual Evidence for the Perfunctoriness of Independent Medical Reviews
process with high social impact functions. Data analysis of this limited to (i) adding ways for patients to check that all the rele-
kind, more akin to hypothesis testing than to predictive modeling, vant documentation has been collected and will be reviewed, and
is in fact one of the original uses of statistical models / methods. (ii) identifying ways to hold the anonymous reviewers to higher
Unfortunately, using ML models in this way does not straightfor- standards of doctor-patient care.
wardly lead to plots showing how ML models obviously improve
metrics like the efficiency or cost of a process. We think, however, ACKNOWLEDGMENTS
that there are as many socially beneficial opportunities for this kind We are grateful to four KDD-KiML anonymous reviewers for their
of data-analysis use of ML modeling as there are for its predictive comments on an earlier version of this paper. We gratefully acknowl-
uses. The main difference between them seems to be that the data- edge the support of the NVIDIA Corporation with the donation of
analysis uses do not lead to more-or-less immediately measurable two Titan V GPUs used for this research, as well as the UCSC Office
products. Instead, they are meant to become part of a larger ar- of Research and The Humanities Institute for a matching grant to
gument and evaluation of a socially and politically relevant issue, purchase additional hardware. The usual disclaimers apply.
e.g., the ethical status of current health-insurance related practices
and consumer protections discussed here. What counts as ‘success’ REFERENCES
when ML models are deployed in this way is less immediate, but [1] Leatrice Berman-Sandler. 2004. Independent Medical Review: Expanding Legal
could provide at least as much social good in the long run. Remedies to Achieve Managed Care Accountability. Annals Health Law 13 (2004).
[2] Kenneth H. Chuang, Wade M. Aubry, and R. Adams Dudley. 2004. Independent
Medical Review Of Health Plan Coverage Denials: Early Trends. Health Affairs
23, 6 (2004), 163–169. https://doi.org/10.1377/hlthaff.23.6.163
5 CONCLUSION AND FUTURE WORK [3] Angus Deaton and Nancy Cartwright. 2018. Understanding and misunderstand-
ing randomized controlled trials. Social Science and Medicine 210 (2018), 2–21.
We examined a database of 26,361 IMRs handled by the California https://doi.org/10.1016/j.socscimed.2017.12.005
DMHC through a private contractor. IMR processes are meant to [4] Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann.
2017. Using millions of emoji occurrences to learn any-domain representations for
provide protection for patients whose doctors prescribe treatments detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on
that are denied by their health insurance. Empirical Methods in Natural Language Processing. Association for Computational
We found that, in a majority of cases, IMRs uphold the health Linguistics, Copenhagen, Denmark, 1615–1625. https://doi.org/10.18653/v1/D17-
1169
insurance denial, despite DMHC’s claim to the contrary. In addition, [5] Filippo Galgani and Achim Hoffmann. 2011. LEXA: Towards Automatic Legal
we analyzed the text of the reviews and compared them with a Citation Classification. In AI 2010: Advances in Artificial Intelligence, Jiuyong Li
sample of 50,000 Yelp reviews and the IMDB movie review corpus. (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 445–454.
[6] Felix Gräundefineder, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder.
Despite the fact that these corpora are basically twice as large, we 2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain
can construct a very good language model for the IMR corpus, and Cross-Data Learning (DH ’18). Association for Computing Machinery, New
York, NY, USA, 121–125. https://doi.org/10.1145/3194658.3194677
as measured by the quality of text generation, as well as its low [7] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.
perplexity and high categorical accuracy on unseen test data. These Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.
results indicate that movie and restaurant reviews exhibit a much 8.1735
[8] Jeremy Howard and Sebastian Ruder. 2018. Fine-tuned Language Models for Text
larger variety, more contentful discussion, and greater attention Classification. CoRR abs/1801.06146 (2018). arXiv:1801.06146 http://arxiv.org/
to detail compared to IMR reviews, which seem highly templatic abs/1801.06146
and perfunctory in comparison. We see similar trends in topic [9] Shanshan Lu. 2018. Data Scientist Job Market in the U.S. https://www.kaggle.
com/sl6149/data-scientist-job-market-in-the-us More info available here: https:
models and classification models predicting binary IMR outcomes //github.com/Silvialss/projects/tree/master/IndeedWebScraping.
and binarized sentiment for Yelp and IMDB reviews. [10] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng,
and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis (HLT
These results were further confirmed by topic and language ’11). Association for Computational Linguistics, Stroudsburg, PA, USA, 142–150.
models for four other specialized-register corpora (drug reviews, [11] Javier Marin, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, Yusuf
data science job postings, legal-case reports and cooking recipes). Aytar, Ingmar Weber, and Antonio Torralba. 2019. Recipe1M+: A Dataset for
Learning Cross-Modal Embeddings for Cooking Recipes and Food Images. IEEE
We are in the process of extending our datasets with (i) workers’ Trans. Pattern Anal. Mach. Intell. (2019).
comp cases from California and (ii) private insurance cases from [12] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing
other states. This will enable us to investigate if the reviews for and Optimizing LSTM Language Models. CoRR abs/1708.02182 (2017).
[13] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017.
workers’ comp cases are substantially different from the DMHC Pointer Sentinel Mixture Models. CoRR abs/1609.07843 (2017).
IMR data (the percentage of upheld decisions is much higher for [14] Michael Röder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the
Space of Topic Coherence Measures (WSDM ’15). ACM, New York, NY, USA,
workers’ comp: ≈ 90%), as well as if the reviews vary substantially 399–408. https://doi.org/10.1145/2684822.2685324
across states. [15] Shirley Eiko Sanematsu. 2001. Taking a broader view of treatment disputes
Another direction for future work is to follow up on our pre- beyond managed care: Are recent legislative efforts the cure? UCLA Law Review
48 (2001).
liminary qualitative research with a survey of patients that have [16] Leslie N. Smith. 2017. Cyclical learning rates for training neural networks. In
experienced the IMR process to see if these patients agree with the Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE.
DMHC-promoted message that the IMR process provides strong 464–472.
[17] Mark Steyvers and Tom Griffiths. 2007. Probabilistic Topic Models. Lawrence
consumer protection against unjustified health-plan denials. This Erlbaum Associates.
could also enable us to verify if the medical documentation col- [18] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transfer-
able are features in deep neural networks?. In Advances in Neural Information
lected during the IMR process is complete and actually taken into Processing Systems. 3320–3328.
account when the decision is made. [19] Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Con-
The ultimate upshot of this project would be a list of recommen- volutional Networks for Text Classification. CoRR abs/1509.01626 (2015).
arXiv:1509.01626 http://arxiv.org/abs/1509.01626
dations for the improvement of the IMR process, including but not