<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>San Diego,
California, USA, August</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Textual Evidence for the Perfunctoriness of Independent Medical Reviews</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adrian Brasoveanu</string-name>
          <email>abrsvn@ucsc.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Megan Moodie</string-name>
          <email>mmoodie@ucsc.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rakshit Agrawal</string-name>
          <email>ragrawal@camio.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Camio Inc.</institution>
          ,
          <addr-line>San Mateo, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of California Santa Cruz</institution>
          ,
          <addr-line>Santa Cruz, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>24</volume>
      <issue>2020</issue>
      <abstract>
        <p>We examine a database of 26,361 Independent Medical Reviews (IMRs) for privately insured patients, handled by the California Department of Managed Health Care (DMHC) through a private contractor. IMR processes are meant to provide protection for patients whose doctors prescribe treatments that are denied by their health insurance (either private insurance or the insurance that is part of their worker comp; we focus on private insurance here). Laws requiring IMR were established in California and other states because patients and their doctors were concerned that health insurance plans deny coverage for medically necessary services. We analyze the text of the reviews and compare them closely with a sample of 50000 Yelp reviews [19] and the corpus of 50000 IMDB movie reviews [10]. Despite the fact that the IMDB corpus is twice as large as the IMR corpus, and the Yelp sample contains almost twice as many reviews, we can construct a very good language model for the IMR corpus using inductive sequential transfer learning, specifically ULMFiT [ 8], as measured by the quality of text generation, as well as low perplexity (11.86) and high categorical accuracy (0.53) on unseen test data, compared to the larger Yelp and IMDB corpora (perplexity: 40.3 and 37, respectively; accuracy: 0.29 and 0.39). We see similar trends in topic models [17] and classification models predicting binary IMR outcomes and binarized sentiment for Yelp and IMDB reviews. We also examine four other corpora (drug reviews [6], data science job postings [9], legal case summaries [5] and cooking recipes [11]) to show that the IMR results are not typical for specialized-register corpora. These results indicate that movie and restaurant reviews exhibit a much larger variety, more contentful discussion, and greater attention to detail compared to IMR reviews, which points to the possibility that a crucial consumer protection mandated by law fails a sizeable class of highly vulnerable patients.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Computing methodologies → Latent Dirichlet allocation;
Neural networks.</p>
    </sec>
    <sec id="sec-2">
      <title>KEYWORDS</title>
      <p>AI for social good, state-managed medical review processes,
language models, topic models, sentiment classification</p>
    </sec>
    <sec id="sec-3">
      <title>INTRODUCTION</title>
    </sec>
    <sec id="sec-4">
      <title>Origin and structure of IMRs</title>
      <p>Independent Medical Review (IMR) processes are meant to provide
protection for patients whose doctors prescribe treatments that are
denied by their health insurance – either private insurance or the
insurance that is part of their workers’ compensation. In this paper,
we focus exclusively on privately insured patients. Laws requiring
IMR processes were established in California and other states in
the late 1990s because patients and their doctors were concerned
that health insurance plans deny coverage for medically necessary
services to maximize profit. 1</p>
      <p>
        As aptly summarized in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], IMR is regularly used to settle
disputes between patients and their health insurers over what is
medically necessary or experimental/investigational care. Medical
necessity disputes occur between health plans and patients because
the health plan disagrees with the patient’s doctor about the
appropriate standard of care or course of treatment for a specific
condition. Under the current system of managed care in the U.S.,
services rendered by a health care provider are reviewed to
determine whether the services are medically necessary, a process
referred to as utilization review (UR). UR is the oversight
mechanism through which private insurers control costs by ensuring
that only medically necessary care, covered under the contractual
terms of a patient’s insurance plan, is provided. Services that are
not deemed medically necessary or fall outside a particular plan
are not covered.
      </p>
      <p>Procedures or treatment protocols are deemed experimental or
investigational because the health plan – but not necessarily the
patient’s doctor, who in many cases has enough clinical confidence
in a treatment to order it – considers them non-routine medical
care, or takes them to be scientifically unproven to treat the specific
condition, illness, or diagnosis for which their use is proposed.</p>
      <p>
        It is important to realize that the IMR process is usually the
third and final stage in the medical review process. The typical
progression is as follows. After in-person and possibly repeated
examination of the patient, the doctor recommends a treatment,
1For California, see the Friedman-Knowles Act of 1996, requiring California health
plans to provide external independent medical review (IMR) for coverage denials. As
of late 2002, 41 states and the District of Columbia had passed legislation creating an
IMR process. In 34 of these states, including California, the decision resulting from the
IMR is binding to the health plan. See [
        <xref ref-type="bibr" rid="ref1 ref15">1, 15</xref>
        ] for summaries of the political and legal
history of the IMR system, and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for an early partial survey of the DMHC IMR data.
which is then submitted for approval to the patient’s health plan.
If the treatment is denied in this first stage, both the doctor and
the patient may file an appeal with the health plan, which triggers
a second stage of reviews by the health-insurance provider, for
which a patient can supply additional information and a doctor
may engage in what is known as a “peer to peer” discussion with a
health-insurance representative. If these second reviews uphold the
initial denial, the only recourse the patient has is the state-regulated
IMR process, and per California law, an IMR grievance form (and
some additional information) is included with the denial letter.
      </p>
      <p>An IMR review must be initiated by the patient and submitted to
the California Department of Managed Health Care (DMHC), which
manages IMRs for privately-insured patients. Motivated treating
physicians may provide statements of support for inclusion in the
documentation provided to DMHC by the patient, but in theory
the IMR creates a new relationship of care between the
reviewing physician(s) hired by a private contractor on behalf of DMHC,
and the patient in question. The reviewing physicians’ decision is
supposed to be made based on what is in the best interest of the
patient, not on cost concerns. It is this relation of care that constitutes
the consumer protection for which IMR processes were legislated.
Understandably, given that the patients in question may be ill or
disabled or simply discouraged by several layers of cumbersome
bureaucratic processes, there is a very high attrition from the initial
review to the final, IMR, stage. That is, only the few highly
motivated and knowledgeable patients – or the extremely desperate –
get as far as the IMR process.</p>
      <p>The IMR process is regulated by the state, but it is actually
conducted by a third party. At this time (2019), the provider in
California and several other states across the US is MAXIMUS Federal
Services, Inc.2 The costs associated with the IMR review, at least
in California, are covered by health insurers. It is DMHC’s and
MAXIMUS’s responsibility to collect all the documentation from
the patient, the patient’s doctor(s) and the health insurer. There
are no independent checks that all the documentation has actually
been collected, however, and patients do not see a final list of what
has been provided to the reviewer prior to the IMR decision itself
(a post facto list of file contents is mailed to patients along with the
ifnal, binding, decision; it is unclear what recourse a patient may
have if they find pertinent information was missing from the review
ifle). Once the documentation is assembled, MAXIMUS forwards it
to anywhere from one to three reviewers, who remain anonymous,
but are certified by MAXIMUS to be appropriately credentialed
and knowledgeable about the treatment(s) and condition(s) under
review. The reviewer submits a summary of the case, and also a
rationale and evidence in support of their decision, which is a binary
Upheld/Overturned decision about the medical service. IMR
reviewers do not enter a consultative relationship with the patient, doctor
or health plan – they must render an uphold/overturn decision
based solely on the provided medical records. However, as noted
above, they are in an implied relationship of care to the patient, a
point to which we return in the Discussion section below (§4).</p>
      <p>While insurance carriers do not provide statistics about the
percentage of requested treatments that are denied in the initial stage,
looking at the process as a whole, a pattern of service denial aimed
2https://www.maximus.com/capability/appeals-imr
to maximize profit, rather than simply maintain cost efectiveness,
seems to emerge. Typically, the argument for denial contends that
the evidence for the beneficial efects of the treatment fails the
prevailing standard of scientific evidence. This prevailing standard
invoked by IMR reviewers is usually randomized control trials
(RCTs), which are expensive, time-consuming trials that are run by
large pharmaceutical companies only if the treatment is ultimately
estimated to be profitable.</p>
      <p>
        RCTs, however, have known limits: they “require minimal
assumptions and can operate with little prior knowledge [which] is
an advantage when persuading distrustful audiences, but it is a
disadvantage for cumulative scientific progress, where prior
knowledge should be built upon, not discarded.” [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] Inflexibly applying
the RCT “gold standard” in the IMR process is often a way to
ignore the doctors’ knowledge and experience in a way that seems
superficially well-reasoned and scientific. “RCTs can play a role in
building scientific knowledge and useful predictions” – and we add,
treatment recommendations – “only [. . . ] as part of a cumulative
program, [in combination] with other methods.” [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
      </p>
      <p>Notably, the experimental/investigational category of treatments
that get denied often includes promising treatments that have not
been fully tested in clinical RCTs – because the treatment is new or
the condition is rare in the population, so treatment development
costs might not ultimately be recovered. Another common category
of experimental/investigational denials involves “of-label” drug
uses, that is, uses of FDA-approved pharmaceuticals for a purpose
other than the narrow one for which the drug was approved.
1.2</p>
    </sec>
    <sec id="sec-5">
      <title>Main argument and predictions</title>
      <p>Recall that these ‘experimental’ treatments or of-label uses are
recommended by the patient’s doctor, and therefore their potential
benefits are taken to outweigh their possible negative efects. The
recommending doctor is likely very familiar with the often lengthy,
tortuous and highly specific medical history of the patient , and with
the list of ‘less experimental’ treatments that have been proven
unsuccessful or have been removed from consideration for
patientspecific reasons. It is also important to remember that many rare
conditions have no “on-label” treatment options available, since
expensive RCTs and treatment approval processes are not undertaken
if companies do not expect to recover their costs, which is likely if
the potential ‘market’ is small (few people have the rare condition).</p>
      <p>Therefore, our main line of argumentation is as follows.
• Since IMRs are the final stage in a long bureaucratic process
in which health insurance companies keep denying coverage
for a treatment repeatedly recommended by a doctor as
medically necessary, we expect that the issue of medical
necessity is non-trivial when that specific patient and that
specific treatment are carefully considered.
• We should therefore expect the text of the IMRs, which
justiifes the final determination, to be highly individualized and
argue for that final decision (whether congruent with the
health plan’s decision or not) in a way that involves the
particulars of the treatment and the particulars of the patient’s
medical history and conditions.</p>
      <p>Thus, we expect a reasoned, thoughtful IMR to not be highly
generic and templatic / predictable in nature. For instance, legal
documents may be highly templatic as they discuss the application
of the same law or policy across many diferent cases, but a response
carefully considering the specifics of a medical case reaching the
IMR stage is not likely to be similar to many other cases. We only
expect high similarity and ‘templaticity’ for IMR reviews if they are
reduced to a more or less automatic application of some prespecified
set of rules (rubber-stamping).
1.3</p>
    </sec>
    <sec id="sec-6">
      <title>Main results, and their limits</title>
      <p>Concomitantly with this quantitative study, we conducted
preliminary qualitative research with a focus on pain management and
chronic conditions. We investigated the history of the IMR process,
in addition to having direct experience with it. We had detailed
conversations with doctors in Northern California and on private
social media groups formed around chronic conditions and pain
management. This preliminary research reliably points towards the
possibility that IMR reviews are perfunctory, and that this crucial
consumer protection mandated by law seems to fail for a sizeable
class of highly vulnerable patients. In this paper, we focus on the
text of the IMR decisions and attempt to quantify the evidence for
the perfunctoriness of the IMR process that they provide.</p>
      <p>The text of the IMR findings does not provide unambiguous
evidence about the quality and appropriateness of the IMR process.
If we had access to the full, anonymized patient files submitted to
the IMR reviewers (in addition to the final IMR decision and the
associated text), we might have been able to provide much stronger
evidence that IMRs should have a significantly higher percentage of
overturns, and that the IMR process should be improved in various
ways, e.g., (i) patients should be able to check that all the relevant
documentation has been collected and will be reviewed, and (ii)
the anonymous reviewers should be held to higher standards of
doctor-patient care. At the very least, one would want to compare
the reports/letters produced by the patient’s doctor(s) and the IMR
texts. However, such information is not available and there are no
visible signs suggesting potential availability in the near future.
The information that is made available by DMHC constitutes the
IMR decision – whether to uphold or overturn the health plan
decision –, the anonymized decision letter, and information about
the requested treatment category (also available in the letter). We,
therefore, had to limit ourselves to the text of the DMHC-provided
IMR findings in our empirical analysis.</p>
      <p>A qualitative inspection of the corpus of IMR decisions made
available by the California DMHC site as of June 2019 (a total of
26,631 cases spanning the years 2001-2019) indicates that the
reviews – as documented in the text of the findings – focus more
on the review procedure and associated legalese than on the
actual medical history of the patient and the details of the case. For
example, decisions for chronic pain management seem to mostly
rubber-stamp the Medical Treatment Utilization Schedule (MTUS)
guidelines, with very little consideration of the rarity of the
underlying condition(s) (see our comments about RCTs above), or
a thoughtful evaluation of the risk/benefit profile of the denied
treatment relative to the specific medical history of the patient
(assuming this history was adequately documented to begin with).</p>
      <p>The goal in this paper is to investigate to what extent
Natural Language Processing (NLP) / Machine Learning (ML)
methods that are able to extract insights from large corpora point in
the same direction, thus mitigating cherry-picking biases that are
sometimes associated with qualitative investigations. In addition
to the IMR text, we perform a comparative study with additional
English-language datasets in an attempt to eliminate data-specific
and problem-specific biases.</p>
      <p>
        • We analyze the text of the IMR reviews and compare them
with a sample of 50,000 Yelp reviews [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and the corpus of
50,000 IMDB movie reviews [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
• As the size of data has significant consequences for
languagemodel training, and NLP/ML models more generally, we
expect models trained on the Yelp and IMDB corpora to
outperform models trained on the IMR corpus, given that
the IMDB corpus is twice as large as the IMR corpus, and
the Yelp samples contain almost twice as many reviews.
• In this paper, we instead demonstrate that we were able
to construct a very good language model for the IMR
corpus using inductive sequential transfer learning, specifically
ULMFiT [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], as measured by the quality of text generation.
• In addition, the model achieves a much lower perplexity
(11.86) and a higher categorical accuracy (0.53) on unseen
test data, compared to models trained on the larger Yelp
and IMDB corpora (perplexity: 40.3 and 37, respectively;
categorical accuracy: 0.29 and 0.39).
• We see similar trends in topic models [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and
classification models predicting binary IMR outcomes and binarized
sentiment for Yelp and IMDB reviews.
      </p>
      <p>
        These results indicate that movie and restaurant reviews
exhibit a much larger variety, more contentful discussion, and greater
attention to detail compared to IMR reviews. In an attempt to
mitigate confirmation bias, as well as potentially significant register
diferences between IMRs and movie or restaurant reviews, we
examine four additional corpora: drug reviews [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], data science
job postings [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], legal case summaries [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and cooking recipes [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
These specialized-register corpora are potentially more similar to
IMRs than IMDB or Yelp: the texts are more likely to be highly
similar, include boilerplate text and have a templatic/standardized
structure. We find that predictability of IMR texts, as measured by
language-model perplexity and categorical accuracy, is higher than
all the comparison datasets by a good margin.
      </p>
      <p>Based on these empirical comparisons, we conclude that we
have strong evidence that the IMR reviews are perfunctory and,
therefore, that a crucial consumer protection mandated by law
seems to fail for a sizeable class of highly vulnerable patients. The
paper is structured as follows. In Section 2, we discuss the datasets
in detail, with a focus on the nature and characteristics of the IMR
data. In Section 3, we discuss the models we use to analyze the IMR,
Yelp and IMDB datasets, as well as the four auxiliary corpora (drug
reviews, data science jobs, legals cases and recipes). The section also
compares and discusses the results of these models. Section 4 puts
all the results together into an argument for the perfunctoriness of
the IMRs. Section 5 concludes the paper and outlines directions for
future work.
2
2.1</p>
    </sec>
    <sec id="sec-7">
      <title>THE DATASETS</title>
    </sec>
    <sec id="sec-8">
      <title>The IMR dataset</title>
      <p>The IMR dataset was obtained from the DMHC website in June
20193 and was minimally preprocessed. It contains 26,361 cases /
observations and 14 variables, 4 of which are the most relevant:
• TreatmentCategory: the main treatment category;
• ReportYear: year the case was reported;
• Determination: indicates if the determination was upheld or
overturned;
• Findings: a summary of the case findings.</p>
      <p>
        The top 14 treatment categories (with percentages of total ≥ 2%),
together with their raw counts and percentages are provided in
Table 1.
3https://data.chhs.ca.gov/dataset/independent-medical-review-imr-determinationstrend.
As comparison datasets, we use the IMDB movie-review dataset [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
which has 50,000 reviews and a binary positive/negative sentiment
classification associated with each review. This dataset will be
particularly useful as a baseline for our ULMFiT transfer-learning
language models (and subsequent transfer-learning classification
models), where we show that we obtain results for the IMDB dataset
that are similar to the ones in the original ULMFiT paper [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>There are 50,000 movie reviews in the IMDB dataset, evenly split
into negative and positive reviews. The histogram of text lengths
for IMDB reviews is provided in Figure 2. The reviews contain a
total of 11,557,297 words. The mean length of a review is 231.15
words, with an SD of 171.32.</p>
      <p>
        We select a sample of 50,000 Yelp (mainly restaurant) reviews [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ],
with associated binarized negative/positive evaluations, to provide
a comparison corpus intermediate between our DMHC dataset and
the IMDB dataset. From a total of 560,000 reviews (evenly split
between negative and positive), we draw a weighted random sample
with the weights provided by the histogram of text lengths for the
IMR corpus. The resulting sample contains 25,809 (52%) negative
reviews and 24,191 (48%) positive reviews. The histogram of text
lengths for Yelp reviews is also provided in Figure 2. The reviews
contain a total of 7,038,467 words. The mean length of a review is
140.77 words, with an SD of 71.09.
h
ltn0.004
g
e
vgne
ifa0.003
xso
t
te0.002
f
o
#
il)e0.001
zda
m
r
(N0.000
o
h
t
g
n
e
ln0.005
vge
ifa0.004
seo
tx0.003
ft)#0.002
o
d
e
liz0.001
a
m
r
0 200 IMR text length (# of wo8rd0s0) 1000 1200 (N0.000 0
      </p>
      <p>o
400 600</p>
    </sec>
    <sec id="sec-9">
      <title>2.3 Four auxiliary datasets</title>
      <p>
        We will also analyze four other specialized-register corpora: drug
reviews [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], data science (DS) job postings [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], legal case reports [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and cooking recipes [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The modeling results for these
specializedregister corpora will enable us to better contextualize and evaluate
the modeling results for the IMR, IMDB and Yelp corpora, since
these four auxiliary datasets might be seen as more similar to the
IMR corpus than movie or restaurant reviews. The drug-review
corpus contains reviews of pharmaceutical products, which are
closer in subject matter to IMRs than movie/restaurant reviews.
The other three corpora are all highly specialized in register, just
like the IMRs, with two of them (DS jobs and legal cases) particularly
similar to the IMRs in that they involve templatic texts containing
information aimed at a specific professional sub-community.
      </p>
      <p>These four corpora are very diferent from each other and from
the IMR corpus in terms of (i) the number of texts that they contain
and (ii) the average text length (number of words per text). Because
of this, there was no obvious way to sample from them and from
the IMR, IMDB and Yelp corpora in such a way that the resulting
samples were both roughly comparable with respect to the total
number of texts and average text length, and also large enough to
obtain reliable model estimates. We therefore analyzed these four
corpora as a whole.</p>
      <p>The drug-review corpus includes 132,300 drugs reviews – more
than the double the number of texts in the IMDB and Yelp datasets,
and more than 4 times the number of texts in the IMR dataset. From
the original corpus of 215,063 reviews, we only retained the reviews
associated with a rating of 10, which we label as positive reviews,
and a rating of 1 through 5, which we label as negative reviews.4
4We did this so that we have a fairly balanced dataset (68,005 positive drug reviews and
64,295 negative reviews) to estimate classification models like the ones we report for
the IMR, IMDB and Yelp corpora in the next section. For completeness, the drug-review
classification results on previously unseen test data are as follows: logistic regression
The histogram of text lengths for drug reviews is provided in
Figure 3. The reviews contain a total of 11,015,248 words, with a
mean length of 83.26 words per review (significantly shorter than
the IMR/IMDB/Yelp texts) and an SD of 45.73.</p>
      <p>The DS corpus includes 6,953 job postings (about a quarter of
the texts in the IMR corpus), with a total of 3,731,051 words. The
histogram of text lengths is provided in Figure 3. The mean length
of a job posting is 536.61 words (more than twice as long as the
IMR/IMDB/Yelp texts), with an SD of 254.06.</p>
      <p>There are 3,890 legal-case reports (even fewer than DS job
postings), with a total of 25,954,650 words (about 5 times larger than
the IMR corpus). The histogram of text lengths for the legal-case
reports is provided in Figure 3. The mean length of a report is 6,672.15
words (a degree of magnitude longer than IMR/IMDB/Yelp), with a
very high SD of 11,997.98.</p>
      <p>Finally, the recipe corpus includes more than 1 million texts:
there are 1,029,719 recipes, with a total of 117,563,275 words (very
large compared to our other corpora). The histogram of text lengths
for the recipes is provided in Figure 3. The mean length of a recipe
is 114.17 words (close to the length of a drug review, and roughly
half of an IMR), with an SD of 90.54.</p>
    </sec>
    <sec id="sec-10">
      <title>3 THE MODELS</title>
      <p>In this section, we analyze the text of the IMR findings and its
predictiveness with respect to IMR outcomes. We systematically
compare these results with the corresponding ones for the IMDB
and Yelp corpora. The datasets were split into training (80%),
validation (10%) and test (10%) sets. Test sets were only used for the
ifnal model evaluation.
accuracy: 77.89%; accuracy of multilayer perceptron with a 1,000-unit hidden layer
and a ReLU non-linearity: 83.18%; ULMFiT classification model accuracy: 96.12%.</p>
      <p>We start with baseline classification models (logistic regressions
and logistic multilayer perceptrons with one hidden layer) to
establish that the reviews in all three datasets under consideration
are highly predictive of the associated binary outcomes. Once the
predictiveness, hence, relevance, of the text is established, we turn
to an in-depth analysis of the texts themselves by means of topic
and language models. We see that the text of the IMR reviews is
significantly diferent (more predictable, less diverse / contentful)
when compared to movie and restaurant reviews. We then turn to
a final set of classification models that leverage transfer learning
from the language models to see how predictive the texts can
really be with respect to the associated binary outcomes. Finally, we
report the results of estimating language models for the 4 auxiliary
datasets introduced in the previous section.</p>
      <p>The main conclusion of this extensive series of models is that
the IMR corpus is an outlier, and it would be easy to make the
IMR process fully automatic: it is pretty straightforward to train
models that generate high-quality, realistic IMR reviews and
generate binary decisions that are very reliably associated with these
reviews. In contrast, movie and restaurant reviews produced by
unpaid volunteers (as well as the 4 auxiliary datasets) exhibit more
human-like depth, sophistication and attention to detail, so current
NLP models do not perform as well on them.
3.1</p>
    </sec>
    <sec id="sec-11">
      <title>Classification models</title>
      <p>We regress outcomes (Upheld/Overturned for IMR or negative/positive
sentiment for IMDB/Yelp) against the text of the corresponding
ifndings / reviews. For the purposes of these basic classification
models, as well as the topics models discussed in the following
subsection, the texts were preprocessed as follows. First, we removed
stop words; for the IMR dataset, we also removed the following
high-frequency words: patient, treatment, reviewer, request,
medical and medically, and for the IMDB dataset, we also removed the
words film and movie. After part-of-speech tagging, we retained
only nouns, adjectives, verbs and adverbs, since lexical meanings
provide the most useful information for logistic (more generally,
feed-forward) models and topic models. The resulting dictionary
for the IMR dataset had 23,188 unique words. We ensured that
the dictionaries for the IMDB and Yelp datasets were also between
23,000 and 24,000 words by eliminating infrequent words. Bounding
the dictionaries for each dataset to a similar range helps mitigate
dataset-specific modeling biases: having diferently-sized
vocabularies leads to diferently-sized parameter spaces for the models.</p>
      <p>We extracted features by converting each text into sparse
bag-ofwords vectors of dictionary length, which recorded how many times
each token occurred in the text. These feature representations were
the input to all the classifier models we consider in this subsection.
The multilayer perceptron model had a single hidden layer with
1,000 units and a ReLU non-linearity. The classification accuracies
on the test data for all three datasets are provided in Table 3.
logistic regression
multilayer perceptron</p>
      <p>We see that the text of the findings / reviews is highly predictive
of the associated binary outcomes, with the highest accuracy for the
IMR dataset despite the fact that it contains half the observations
of the other two data sets. We can therefore turn to a more
indepth analysis of the texts to understand what kind of textual
justification is used to motivate the IMR binary decisions. To that
end, we examine and compare the results of two
unsupervised/selfsupervised types of models: topic models and language models.
3.2</p>
    </sec>
    <sec id="sec-12">
      <title>Topic models</title>
      <p>
        Topic modeling [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] is an unsupervised method that distills
semantic properties of words and documents in a corpus in terms of
probabilistic topics. The most widespread measure for topic model
evaluation is the coherence score [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Typically, as we increase
the number of topics from very few, say, 4 topics, to more of them,
we see an increase in coherence score that tends to level out after
a certain number of topics. When modeling the IMDB and Yelp
datasets, we see exactly this behavior, as shown in Figure 4.
      </p>
      <p>In contrast, the 4-topic model has the highest coherence score
(0.56) for the IMR data set, also shown in Figure 4. Furthermore,
as we add more topics, the coherence score drops. As the word
clouds for the 4-topic model in Figure 5 show, these 4 topics mostly
reflect the legalese associated with the IMR review procedure and
very little, if anything, of the treatments and conditions that were
the main point of the review. In contrast, the corresponding
highscoring topic models for the IMDB and Yelp datasets reflect actual
features of movies, e.g., family-life movies, westerns, musicals etc.,
or breakfast/lunch places, restaurants, shops, bars, hotels etc.</p>
      <p>Recall that IMRs are the legally-mandated last resort for patients
seeking treatments (usually) ordered by their doctors, and which
their health plan refuses to cover. The reviews are conducted
exclusively based on documentation. Putting aside the fact that it is
unclear how much efort is taken to ensure that the documentation
is complete, especially for patients with extensive and complicated
health records, we see that relatively little specific information
about a patients’ medical history, condition(s), or the recommended
treatments are reflected in the text of these decisions. The text seems
to consist largely of legalese about the IMR process, the health plan
/ providers, basic demographic information about the patient, and
generalities about the medical service or therapy requested for the
enrollee’s condition.
3.3</p>
    </sec>
    <sec id="sec-13">
      <title>Language models with transfer learning</title>
      <p>
        Language models, specifically using neural networks, are usually
recurrent-network or transformer based architectures designed
to learn textual distributional patterns in an unsupervised or
selfsupervised manner. Recurrent-network models – on which we
focus here – commonly use Long Short-Term Memory (LSTM) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
“cells,” which are able to learn long-term dependencies in sequences.
Representing text as a sequence of words, language models build
rich representations of the words, sentences, and their relations
within a certain language. We estimate a language model for the
IMR corpus using inductive sequential transfer learning, specifically
ULMFiT [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Just as [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we use the AWD-LSTM model [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], a vanilla
LSTM with 4 kinds of dropout regularization, embedding size of
400, 3 LSTM layers (1,150 units per layer), and a BPTT of size 70.
Coherence scores
      </p>
      <p>
        Coherence scores
Coherence scores
The AWD-LSTM model is pretrained on Wikitext-103 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
consisting of 28, 595 preprocessed Wikipedia articles, with a total of 103
million words. This pretrained model is fairly simple (no attention,
skip connections etc.), and the pretraining corpus is of modest size.
      </p>
      <p>
        To obtain our final language models for the IMR, IMDB and
Yelp corpora, we fine-tune the pretrained AWD-LSTM model using
discriminative [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and slanted triangular [
        <xref ref-type="bibr" rid="ref16 ref8">8, 16</xref>
        ] learning rates. We
do the same kind of minimal text preprocessing as in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>The perplexity and categorical accuracy for the 3 language
models are provided in Table 4. The perplexity for the IMR findings is
much lower than for the IMDB / Yelp reviews, and the language
model can correctly guess the next word more than half the time.</p>
      <p>The IMR language model can generate high quality and largely
coherent text, unlike the IMDB / Yelp models. Two samples of
generated text are provided below (the ‘seed’ text is boldfaced).
• The issue in this case is whether the requested partial
hospitalization program ( PHP ) services are medically necessary
for treatment of the patient ’s behavioral health condition
. The American Psychiatric Association ( APA ) treatment
guidelines for patients with eating disorders also consider
PHP acute care to be the most appropriate setting for
treatment , and suggest that patients should be treated in the least
restrictive setting which is likely to be safe and efective .
The PHP was initially recommended for patients who were
based on their own medical needs , but who were
• The patient was admitted to a skilled nursing facility (
SNF ) on 12 / 10 / 04 . The submitted documentation states
the patient was discharged from the hospital on 12 / 22 /
04 . The following day the patient ’s vital signs were
stable . The patient had been ambulating to the community
with assistance with transfers , but has not had any recent
medical or rehabilitation therapy . The patient had no new
medical problems and was discharged in stable condition .
The patient has requested reimbursement for the inpatient
acute rehabilitation services provided</p>
      <p>We see that the IMR language model is highly performant,
despite the simple model architecture we used, the modest size of
the pretraining corpus, and the small size of the IMR corpus. The
quality of the generated text is also very high, particularly given
all these limitations.
3.4</p>
    </sec>
    <sec id="sec-14">
      <title>Classification with transfer learning</title>
      <p>
        We further fine-tune the language models discussed in the previous
subsection to train classifiers for the three datasets. Following [
        <xref ref-type="bibr" rid="ref4 ref8">4, 8</xref>
        ],
we gradually unfreeze the classifier models to avoid catastrophic
forgetting.
      </p>
      <p>The results of evaluating the classifiers on the withheld test
sets are provided in Table 5. Despite the fact that the IMR dataset
contains half of the classification observations of the other two
datasets, we obtain the highest level of accuracy when predicting
binary Upheld/Overturned decisions based on the text of the IMR
ifndings.
We also estimated topic and language models for the 4 auxiliary
corpora (drug reviews, DS jobs, legal cases and cooking recipes).
The associations between coherence scores and number of topics
for these 4 corpora was similar to the ones plotted in Figure 4 above
for the IMDB and Yelp corpora. For all 4 auxiliary corpora, the best
topic models had at least 14 topics, often more, with coherence
scores above 0.5. The quality of the topics was also high, with
intuitively coherent and contentful topics (just like IMDB / Yelp).</p>
      <p>The perplexity and accuracy of the ULMFiT language models
on previously-withheld test data are provided in Table 6, which
contains the results for all the 7 datasets under consideration in
this paper. We see that the predictability of the IMR corpus, as
reflected in its perplexity and categorical accuracy scores, is still
clearly higher than the 4 auxiliary corpora. The perplexity of the
legal-case corpus (18.17) is somewhat close to the IMR perplexity
(11.86), but we should remember that the legal-case corpus is about
5 times larger than the IMR corpus. Furthermore, the legal-case
categorical accuracy of 43% is still substantially lower than the IMR
accuracy of 53%. Notably, even the recipe corpus, which is about 20
times larger than the IMR corpus (≈ 117.5 vs. ≈ 5.5 million words)
does not have test-set scores similar to the IMR scores.</p>
      <p>The results for these 4 auxiliary corpora indicate that the IMR
corpus is an outlier, with very highly templatic and generic texts.
4</p>
    </sec>
    <sec id="sec-15">
      <title>DISCUSSION</title>
      <p>The models discussed in the previous section show that
languagemodel learning is significantly easier for IMRs compared to the other
6 corpora. As can be seen in Table 6, perplexity in the language
model for IMR reviews is clearly lower than even legal cases, for
which we expect highly templatic language and high similarity
between texts. This pattern can be clearly observed in Figure 6,
with the IMR corpus clearly at the very end of the high-to-low
predictability spectrum.</p>
      <p>One would not expect such highly predictable texts in an ideal
scenario, where each medical review is thorough, and each
decision is accompanied by strong medical reasoning relying on the
specifics of the case at hand, and based on an objective physician’s,
or team of physicians’, opinion as to what is in the patient’s best
interest. Arguably, these medically complex cases are as diverse as
Hollywood blockbusters or fashionable restaurants – the patients
themselves certainly experience them as unique and meaningful
–, and their reviews should be similarly diverse, or at most as
templatic as a job posting or a cooking recipe. We wouldn’t expect
40
35
these medical reviews to be so much more predictable and generic
than less socially consequential reviews of movies and restaurants.</p>
      <p>What are the ethical and potentially legal consequences of these
ifndings? First, while state legislators assume we have strong
healthinsurance related consumer protections in place, an image DMHC
goes to great lengths to promote, we find the reviews to be
upholding insurance plan denials at rates that exceed what one might
expect, given that the treatments in question are frequently being
ordered by a treating physician, and that the IMR process is the last
stage in a bureaucratically laborious (hence high-attrition) process
of appealing health-plan denials.</p>
      <p>Second, given that the IMR process creates an implied relation
of care between the reviewers hired by MAXIMUS and the patient –
since reviewers are, after all, being entrusted with the best interests
of the patient without regard to cost –, one can hardly say that they
are fulfilling their obligations as doctors to their patient with such
seemingly rote, perfunctory reviews.</p>
      <p>Third, if IMR processes were designed to make sure that (i)
treatment decisions are being made by doctors, not by profit-driven
businesses, and (ii) insurance companies cannot welch on their
responsibilities to plan members, one must wonder whether
prescribing physicians are wrong more than half the time. Do American
doctors really order so many erroneous, medically unnecessary
treatments and medications? If so, how is it possible that they are
so committed and confident in them that they are willing to escalate
the appeal process all the way to the state-managed IMR stage?
Or is it that IMRs often serve as a final rubber stamp for
healthinsurance plan denials, failing their stated mission of protecting a
vulnerable population?</p>
      <p>We end this discussion section by briefly reflecting on the way
we used ML/NLP methods for social good problems in this paper.
Overwhelmingly, the social-good applications of these methods
and models seem to be predictive in nature: their goal is to improve
the outcomes of a decision-making process, and the improvement
is evaluated according to various performance-related metrics. An
important class of metrics that are currently being developed have
to do with ethical, or ‘safe,’ uses of ML/AI models.</p>
      <p>In contrast, our use of ML models in this paper was analytical,
with the goal of extracting insights from large datasets that enable
us to empirically evaluate how well an established decision-making
process with high social impact functions. Data analysis of this
kind, more akin to hypothesis testing than to predictive modeling,
is in fact one of the original uses of statistical models / methods.
limited to (i) adding ways for patients to check that all the
relevant documentation has been collected and will be reviewed, and
(ii) identifying ways to hold the anonymous reviewers to higher
Unfortunately, using ML models in this way does not
straightforstandards of doctor-patient care.
wardly lead to plots showing how ML models obviously improve
metrics like the eficiency or cost of a process. We think, however,
that there are as many socially beneficial opportunities for this kind
of data-analysis use of ML modeling as there are for its predictive
uses. The main diference between them seems to be that the
dataanalysis uses do not lead to more-or-less immediately measurable
products. Instead, they are meant to become part of a larger
argument and evaluation of a socially and politically relevant issue,
e.g., the ethical status of current health-insurance related practices
and consumer protections discussed here. What counts as ‘success’
when ML models are deployed in this way is less immediate, but
could provide at least as much social good in the long run.
5</p>
    </sec>
    <sec id="sec-16">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>We examined a database of 26,361 IMRs handled by the California
DMHC through a private contractor. IMR processes are meant to
provide protection for patients whose doctors prescribe treatments
that are denied by their health insurance.</p>
      <p>We found that, in a majority of cases, IMRs uphold the health
insurance denial, despite DMHC’s claim to the contrary. In addition,
we analyzed the text of the reviews and compared them with a
sample of 50,000 Yelp reviews and the IMDB movie review corpus.
Despite the fact that these corpora are basically twice as large, we
can construct a very good language model for the IMR corpus,
as measured by the quality of text generation, as well as its low
perplexity and high categorical accuracy on unseen test data. These
results indicate that movie and restaurant reviews exhibit a much
larger variety, more contentful discussion, and greater attention
to detail compared to IMR reviews, which seem highly templatic
and perfunctory in comparison. We see similar trends in topic
models and classification models predicting binary IMR outcomes
and binarized sentiment for Yelp and IMDB reviews.</p>
      <p>These results were further conrfimed by topic and language
models for four other specialized-register corpora (drug reviews,
data science job postings, legal-case reports and cooking recipes).</p>
      <p>We are in the process of extending our datasets with (i) workers’
comp cases from California and (ii) private insurance cases from
other states. This will enable us to investigate if the reviews for
workers’ comp cases are substantially diferent from the DMHC
IMR data (the percentage of upheld decisions is much higher for
workers’ comp: ≈ 90%), as well as if the reviews vary substantially
across states.</p>
      <p>Another direction for future work is to follow up on our
preliminary qualitative research with a survey of patients that have
experienced the IMR process to see if these patients agree with the
DMHC-promoted message that the IMR process provides strong
consumer protection against unjustified health-plan denials. This
could also enable us to verify if the medical documentation
collected during the IMR process is complete and actually taken into
account when the decision is made.</p>
      <p>The ultimate upshot of this project would be a list of
recommendations for the improvement of the IMR process, including but not</p>
    </sec>
    <sec id="sec-17">
      <title>ACKNOWLEDGMENTS</title>
      <p>We are grateful to four KDD-KiML anonymous reviewers for their
comments on an earlier version of this paper. We gratefully
acknowledge the support of the NVIDIA Corporation with the donation of
two Titan V GPUs used for this research, as well as the UCSC Ofice
of Research and The Humanities Institute for a matching grant to
purchase additional hardware. The usual disclaimers apply.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Leatrice</given-names>
            <surname>Berman-Sandler</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Independent Medical Review: Expanding Legal Remedies to Achieve Managed Care Accountability</article-title>
          .
          <source>Annals Health Law</source>
          <volume>13</volume>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Kenneth</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Chuang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Wade M. Aubry</surname>
            , and
            <given-names>R. Adams</given-names>
          </string-name>
          <string-name>
            <surname>Dudley</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Independent Medical Review Of Health Plan Coverage Denials: Early Trends</article-title>
          .
          <source>Health Afairs 23</source>
          ,
          <issue>6</issue>
          (
          <year>2004</year>
          ),
          <fpage>163</fpage>
          -
          <lpage>169</lpage>
          . https://doi.org/10.1377/hlthaf.23.6.
          <fpage>163</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Angus</given-names>
            <surname>Deaton</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nancy</given-names>
            <surname>Cartwright</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Understanding and misunderstanding randomized controlled trials</article-title>
          .
          <source>Social Science and Medicine</source>
          <volume>210</volume>
          (
          <year>2018</year>
          ),
          <fpage>2</fpage>
          -
          <lpage>21</lpage>
          . https://doi.org/10.1016/j.socscimed.
          <year>2017</year>
          .
          <volume>12</volume>
          .005
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Bjarke</given-names>
            <surname>Felbo</surname>
          </string-name>
          , Alan Mislove, Anders Søgaard, Iyad Rahwan, and
          <string-name>
            <given-names>Sune</given-names>
            <surname>Lehmann</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm</article-title>
          .
          <source>In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics</source>
          , Copenhagen, Denmark,
          <fpage>1615</fpage>
          -
          <lpage>1625</lpage>
          . https://doi.org/10.18653/v1/
          <fpage>D17</fpage>
          - 1169
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Filippo</given-names>
            <surname>Galgani</surname>
          </string-name>
          and
          <string-name>
            <given-names>Achim</given-names>
            <surname>Hofmann</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>LEXA: Towards Automatic Legal Citation Classification</article-title>
          .
          <source>In AI 2010: Advances in Artificial Intelligence</source>
          , Jiuyong Li (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg,
          <fpage>445</fpage>
          -
          <lpage>454</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Felix</given-names>
            <surname>Gräundefineder</surname>
          </string-name>
          , Surya Kallumadi, Hagen Malberg, and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Zaunseder</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning (DH '18). Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <fpage>121</fpage>
          -
          <lpage>125</lpage>
          . https://doi.org/10.1145/3194658.3194677
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Long Short-Term Memory</article-title>
          .
          <source>Neural Comput. 9</source>
          ,
          <issue>8</issue>
          (Nov.
          <year>1997</year>
          ),
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          . https://doi.org/10.1162/neco.
          <year>1997</year>
          .
          <volume>9</volume>
          . 8.
          <fpage>1735</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jeremy</given-names>
            <surname>Howard</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Ruder</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Fine-tuned Language Models for Text Classification</article-title>
          . CoRR abs/
          <year>1801</year>
          .06146 (
          <year>2018</year>
          ). arXiv:
          <year>1801</year>
          .06146 http://arxiv.org/ abs/
          <year>1801</year>
          .06146
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Shanshan</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Data Scientist Job Market in the U</article-title>
          .S. https://www.kaggle.
          <article-title>com/sl6149/data-scientist-job-market-in-the-us More info available here</article-title>
          : https: //github.com/Silvialss/projects/tree/master/IndeedWebScraping.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Andrew L. Maas</surname>
            , Raymond E. Daly, Peter T. Pham, Dan Huang,
            <given-names>Andrew Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            , and
            <given-names>Christopher</given-names>
          </string-name>
          <string-name>
            <surname>Potts</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Learning Word Vectors for Sentiment Analysis (HLT '11). Association for Computational Linguistics</article-title>
          , Stroudsburg, PA, USA,
          <fpage>142</fpage>
          -
          <lpage>150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Javier</surname>
            <given-names>Marin</given-names>
          </string-name>
          , Aritro Biswas, Ferda Ofli, Nicholas Hynes,
          <string-name>
            <given-names>Amaia</given-names>
            <surname>Salvador</surname>
          </string-name>
          , Yusuf Aytar, Ingmar Weber, and Antonio Torralba.
          <year>2019</year>
          .
          <article-title>Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images</article-title>
          .
          <source>IEEE Trans. Pattern Anal. Mach</source>
          . Intell. (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Stephen</surname>
            <given-names>Merity</given-names>
          </string-name>
          , Nitish Shirish Keskar, and Richard Socher.
          <year>2017</year>
          .
          <article-title>Regularizing and Optimizing LSTM Language Models</article-title>
          .
          <source>CoRR abs/1708</source>
          .02182 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Stephen</surname>
            <given-names>Merity</given-names>
          </string-name>
          , Caiming Xiong, James Bradbury, and Richard Socher.
          <year>2017</year>
          .
          <article-title>Pointer Sentinel Mixture Models</article-title>
          .
          <source>CoRR abs/1609</source>
          .07843 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Röder</surname>
          </string-name>
          , Andreas Both, and
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Hinneburg</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Exploring the Space of Topic Coherence Measures (WSDM '15)</article-title>
          . ACM, New York, NY, USA,
          <fpage>399</fpage>
          -
          <lpage>408</lpage>
          . https://doi.org/10.1145/2684822.2685324
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Shirley</surname>
            <given-names>Eiko</given-names>
          </string-name>
          <string-name>
            <surname>Sanematsu</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Taking a broader view of treatment disputes beyond managed care: Are recent legislative eforts the cure</article-title>
          ?
          <source>UCLA Law Review</source>
          <volume>48</volume>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Leslie</surname>
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Cyclical learning rates for training neural networks</article-title>
          .
          <source>In Applications of Computer Vision (WACV)</source>
          ,
          <source>2017 IEEE Winter Conference on. IEEE</source>
          .
          <volume>464</volume>
          -
          <fpage>472</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Steyvers</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tom</given-names>
            <surname>Grifiths</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Probabilistic Topic Models</article-title>
          . Lawrence Erlbaum Associates.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Jason</surname>
            <given-names>Yosinski</given-names>
          </string-name>
          , Jef Clune, Yoshua Bengio, and
          <string-name>
            <given-names>Hod</given-names>
            <surname>Lipson</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>How transferable are features in deep neural networks?</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          .
          <volume>3320</volume>
          -
          <fpage>3328</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Xiang</surname>
            <given-names>Zhang</given-names>
          </string-name>
          ,
          <source>Junbo Jake Zhao, and Yann LeCun</source>
          .
          <year>2015</year>
          .
          <article-title>Character-level Convolutional Networks for Text Classification</article-title>
          .
          <source>CoRR abs/1509</source>
          .01626 (
          <year>2015</year>
          ). arXiv:
          <volume>1509</volume>
          .01626 http://arxiv.org/abs/1509.01626
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>