<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Bicluster phenotyping of healthcare providers, procedures, and prescriptions with deep learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nathanael Fillmore</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ansh K. Mehta</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jeremy C. Weiss</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Carnegie Mellon University, Heinz College</institution>
          ,
          <addr-line>Pittsburgh, PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Veterans A airs &amp; Dana-Farber Cancer Institue</institution>
          ,
          <addr-line>Boston, MA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of South Alabama Hospitals</institution>
          ,
          <addr-line>Mobile, AL</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>30</volume>
      <issue>2018</issue>
      <abstract>
        <p>We consider the task of biclustering healthcare providers ( 106), procedures ( 104), and drugs ( 103) based on 2015 counts of claims by provider. The problem with many clustering methods is the scale of the data and the sparse count data where the notion of \distance" is unclear. We use neural networks, which are frequently used for representation learning, to represent the sparse high-dimensional information with a low-dimensional embedding. While previous work has performed standard clustering methods in the embedding space, we instead leverage a low-dimensional embedding with geometric constraints that enables simultaneous image recovery and cluster determination.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>We consider the task of biclustering healthcare providers ( 106),
procedures ( 104), and drugs ( 103). Standard ontologies exist that group medical
providers, procedures, and prescriptions into categories, but these categories do
not always correspond to clusters that are relevant for every question of interest,
and some standard categories, like Internal Medicine, are quite broad. Instead,
we would like a tool that captures practice characteristics de-novo to compare
against collected measures and ontologies. This has several applications: we can
measure concordance of the existing hierarchies with those identi ed in data,
we can look at changes in practice patterns over time, we can investigate
differences in practice within and across nominal subgroups based on practice and
utilization, and we can investigate co-occurrence.</p>
      <p>Many clustering methods perform poorly on this task because of the scale
of the data and because, in sparse count data, the most appropriate notion of
\distance" is not entirely clear. Additionally, centroid-based clustering in our
sparse count data space is problematic because identifying an appropriate
centroid location is non-obvious. Using the (possibly weighted) average value of
members assumes an L2 space which is problematic for high dimensional, sparse
count data. Moreover, it is well known that distances tend to be similar in high
dimensions, and this problem is only made worse by sparse count data.</p>
      <p>
        In this work, we describe a method that learns a low-dimensional
embedding with geometric constraints that enables simultaneous image recovery and
cluster determination. We propose a generic architecture that simultaneously
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) optimizes image recovery in an autoencoding framework, (2) creates a
lowdimensional embedded representation of the high-dimensional space, (3) assigns
examples to clusters in a soft clustering, and (4) optimizes the quality of this
clustering.
      </p>
      <p>Several authors have recently proposed
methods that are conceptually similar to our
contribution. For example, [5] also formulates
a method that integrates a clustering module
in a deep learning framework, as does [4].
However, both these methods are based on
identifying centroids in the latent space, which is not
appropriate for our data. Other related work
includes methods to learn metric embeddings
with neural networks, e.g., [2], but as these
methods do not incorporate a clustering objec- Fig. 1. Autoencoder with
clustive in the neural network, the learned embed- tering module. Nodes are Input
ding is not necessarily suitable for clustering. (I), Output (O), Hidden (H), and</p>
      <p>In our application, cosine similarity is ap- Cluster (C). H and O are
conpropriate, so we use it in our examples. Many strained by batch cosine
similarhealthcare datasets consist of sparse count ity. H (h ) and C are constrained
data in high-dimensional spaces as a conse- by batch cosine similarity.
quence of counting visits, billing codes,
procedure codes, drug prescriptions, etc. Our tool is well-suited to data of this nature.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <p>Consider a sparse count matrix with N rows and K columns. De ne M
an N K as a matrix with elements log transformed using the function f (x) =
log(1+x). We are interested in clustering over rows (providers) and columns
(prescriptions and procedures). We will consider clustering across rows and columns
separately. For each case, we construct an autoencoder consisting of an encoder
1 and a decoder 2. De ne the hidden representation h of size H such that
1(M ) = h and M^ as the reconstruction given by 2 1(M ) = M^ . De ne 3
as the function transformation for the clustering module that takes as input h
and outputs cluster probabilities c: 3(h) = c. Fig. 1 illustrates the framework
for the neural network.</p>
      <p>For log-transformed sparse count data, cosine similarity is a useful distance
representation. However, computing the cosine similarity for all pairs may be
problematic when N is large: O(N 2). We propose to learn a hidden
representation h = fh ; hjj jjg where the cosine similarity between examples in M is
approximately preserved in h of size H 1 units and where the nal unit hjj jj
captures information in the norm. To achieve this, we de ne batch cosine
similarity (O(B2)) and its loss Lh ;O as a penalization to the autoencoder loss.</p>
      <p>De ne batch size B such that hB and M^ B are of size B f g for H and M
respectively. Let (xi; xj ) be the pairwise cosine similarity. De ne B the batch
cosine similarity, i.e. B(fx1; : : : ; xBg) = [ (xi; xj )]ij 8i; j 2 f1; : : : ; Bg. Then,
Lh ;O = B1 jj B(hB)</p>
      <p>
        While Lh ;O encourages matching Layer Index Size Act.
angles in h and M^ , two vectors in h (DIennpsuet) 0{ BB KH LReLU
could have the same cosine angle but Haar wavelet { B H {
di erent embedding locations. Note EPoxopland, permute {1 B B H H P {{
this could be problematic because the Sum (0) and (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) { B H {
hidden vectors are used to learn clus- BDaetncshe norm {{ BB HH LR{eLU
ter membership probabilities c. To en- Batch norm { B H {
courage approximate injectivity, we BDaetncshe norm {{ BB HH LR{eLU
introduce the loss Ljj jj=1 that penal- Dense { B H LReLU
izes hidden representations away from (DHeni dseden) {2 BB HH LReLU
the surface of the unit norm hyper- Dense { B H LReLU
sphere. We could enforce this as a DDeennssee {{ BB KH LLRReeLLUU
hard constraint, however, the ability (Output) { B K
to violate the constraint may facilitate Copy (2) { B H {
alignment of the embedded represen- Dense { B H LReLU
tation angle with that of the output DDeennssee {{ BB HC SLoRftemLaUx
space. (Cluster) { B C
      </p>
      <p>The hidden representation h pre- Table 1. Deep network architecture
serves angular distances of the output
space, and its unit norm and approximate injectivity are useful for our
clustering. Note that for a probability vector that sums to 1, its element-wise square
root vector has unit norm and can be interpreted as an angle. Therefore, we can
match the angular representation of c 21 to that of h with batch cosine similarity.</p>
      <p>De ne cB = 3(hB) the set of probability vectors indicating cluster
mem1
bership. Note that the cosine similarity of cB 2 is non-negative and we are not
interested in incurring loss due to di erences between 0 and negative cosine
similarities. We de ne Lc;h = B1 jj B(cB 21 ) max( B(hB); 0)jj22.</p>
      <p>We add two more loss terms to assist representation learning: h spread and
c entropy. For level sets of batch cosine loss, we prefer that embedded vectors
in h be spread apart to minimize coincidental overlap for suboptimally located
members of h and for future unseen points from the underlying distribution.
We also impose an entropy loss term to encourage cluster probabilities to be
spread across more than one cluster to encourage exploration in cluster
membership and avoid local optima. Thus, our overall objective function is Pi iLi
for Li 2 fLM;M^ ; Ljjh jj=1; Lh ;O; Lc;h ; Lspread; Lentropyg. We set i respectively:
i 2 f1; 10 1; 10 1; 1; 10 4; 10 4g.</p>
      <p>The autoencoder framework we adopt is shown in Table 1. The
permutepool layer [3] copies the tensor P times, permutes the values, and performs
max-pooling over the P dimension with size 3 and stride 2. For our experiments
we set B = 128, P = 4, and H = 128. We set C to be twice the desired
number of centroids, and in post-processing merge the clusters identi ed based
on maximum pairwise cluster cosine similarity.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>Evaluation on simulated data. We evaluate our method on real and
simulated data. First, on simulated data generated so as to be similar to our
target healthcare application, we show that our method typically nds a
clus</p>
      <p>Rand Jaccard Recall Precision
Data Ours OKM HKM Ours OKM HKM Ours OKM HKM Ours OKM HKM
baseline 1.00 0.84 1.00 1.00 0.12 0.99 1.00 0.14 1.00 1.00 0.58 1.00
samples=100 1.00 0.85 0.91 0.90 0.11 0.17 0.90 0.12 0.19 1.00 0.60 0.65
dims=100000 0.93 0.21 0.48 0.08 0.04 0.04 0.14 0.04 0.05 0.17 0.87 0.62
explode=1000000 1.00 0.84 1.00 1.00 0.13 1.00 1.00 0.14 1.00 1.00 0.59 1.00
centroids=100 0.99 0.81 1.00 0.47 0.03 0.95 0.47 0.03 0.96 1.00 0.64 0.99
sd=0.1 0.93 0.28 0.78 0.08 0.04 0.05 0.15 0.04 0.06 0.15 0.79 0.29
tering that, under several measures, is more similar to the ground truth labels
from the simulation than competing methods are.</p>
      <p>We generate data as follows. Centroids are sampled from a multivariate
normal N (0; 1) and normalized onto the unit ball. We generate an equal number
of samples for each centroid by adding noise distributed according to N (0; 2I).
Then we translate these points away from the origin by multiplying each point
by an \explosion" factor drawn from a random uniform on [1; ]. Thus, the
simulation is parameterized by the number of centroids, the number of
dimensions, the explode factor, the number of samples, and the standard deviation of
the noise. We set each parameter to a baseline level and vary one at a time to
test our algorithm.</p>
      <p>We cluster the data using our method, k-means in the original simulated
data space, and k-means in the hidden space. We evaluate results of these
clustering methods using precision, recall, the Rand index, and the Jaccard index,
all relative to the ground truth labels from the simulation. Results are shown in
Table 2. By these measures, our method almost always outperforms k-means in
the original space, and usually outperforms k-means in the hidden space.
Evaluation on real healthcare data. As a case study to demonstrate our
method's real world signi cance, we used our method to bicluster healthcare
providers, prescriptions, and procedures based on the number of prescriptions
and procedures administered by each provider, and the number of providers
administering each prescription or procedure as recorded.</p>
      <p>Speci cally, we used Medicare Provider Utilization and Payment Data: Part
D Prescriber Summary Table CY2015 [1], which tabulates all prescriptions given
under the Medicare Part D program in 2015 in the United States. In the interest
of space, we only show results for a clustering of providers based on the
prescriptions they gave. Our physician authors assessed the quality of the resulting
clusters.</p>
      <p>The clusterings formed by our method and by k-means, each with 20 clusters,
are shown in Table 3. Our method produces qualitatively better clusters than
k-means does. For example, our clustering consistently includes a larger
fraction of specialists in the specialist cluster, e.g., Dentist (85k), Psychiatry (21k),
Emergency Medicine (20k). Our clustering, unlike k-means, also identi es clean
obstetrics and hematology oncology clusters. K-means identi es a cardiology and
interventional cardiology cluster that our method does not, however our
clustering identi ed this cluster and merged it (Cardiology (18k), Nurse Prac (3k),
Internal Medicine (2k)) into the internal medicine subgroup in post-processing.</p>
      <p>Our method also
provides insight in regard
to providers in ontology
specialties that are not
as common. For
example, our method's
urology cluster also includes
a large fraction of the
radiation oncologists in
our dataset. On detailed
assessment of this
cluster, we found that
urology</p>
      <p>medications
tamsulosin and
were
most</p>
      <p>nasteride
commonly
prescribed in this
cluster and that radiation
oncologists</p>
      <p>most
commonly prescribed
tamsulosin, followed by
hydrocodone/acetaminophen and dexamethasone,
possibly for prevention
and</p>
      <p>treatment of
radiation therapy related
(a) Our Clustering (Top 3 Specialties)
Dentist (85k); Oral Surgery (3k); Podiatry (1k)
Internal Med (82k); Family Practice (82k); Nurse Prac (44k)
Psychiatry (21k); Nurse Prac (9k); Psychiatry &amp; Neurology (6k)
Emergency Medicine (21k); Orthopedic Surgery (15k); Phys Asst (15k)
Optometry (18k); Ophthalmology (17k); Student (&lt;1k)
Obstetrics/Gynecology (17k); Nurse Prac (2k); Phys Asst (&lt;1k)
Gastroenterology (12k); Nurse Prac (3k); Internal Med (3k)
Dermatology (10k); Phys Asst (3k); Nurse Prac (1k)
Neurology (10k); Nurse Prac (1k); Psychiatry &amp; Neurology (1k)
Urology (9k); Phys Asst (1k); Nurse Prac (1k)
Nurse Prac (8k); Phys Asst (6k); Emergency Medicine (6k)
Pulmonary Disease (7k); Allergy/Immunology (3k); Otolaryngology (2k)
Hematology/Oncology (6k); Nurse Prac (2k); Medical Oncology (2k)
Internal Med (5k); Emergency Medicine (3k); Nurse Prac (2k)
Phys Asst (4k); Nurse Prac (4k); Orthopedic Surgery (2k)
Dentist (3k); Emergency Medicine (2k); Phys Asst (2k)
Infectious Disease (3k); Obstetrics/Gynecology (2k); Nurse Prac (2k)
Pharmacist (2k); Nurse Prac (1k); Internal Med (1k)
Podiatry (2k); Nurse Prac (1k); Optometry (1k)
Physical Med/Rehab (2k); Podiatry (2k); Nurse Prac (2k)
(b) K-Means Clustering (Top 3 Specialties)
Dentist (52k); Nurse Prac (1k); Phys Asst (1k)
Nurse Prac (35k); Phys Asst (27k); Internal Med (27k)
Family Practice (23k); Internal Med (17k); Nurse Prac (10k)
Family Practice (22k); Internal Med (20k); Nurse Prac (4k)
Emergency Medicine (20k); Orthopedic Surgery (13k); Phys Asst (12k)
Dentist (19k); Oral Surgery (3k); Maxillofacial Surgery (1k)
Internal Med (16k); Nurse Prac (12k); Family Practice (9k)
Family Practice (16k); Nurse Prac (13k); Internal Med (11k)
Cardiology (14k); Nurse Prac (1k); Interventional Cardiology (1k)
Dentist (14k); Oral Surgery (&lt;1k); Infectious Disease (&lt;1k)
Optometry (12k); Ophthalmology (5k); Student (&lt;1k)
Ophthalmology (11k); Optometry (2k); Student (&lt;1k)
Internal Med (11k); Family Practice (10k); General Practice (1k)
Psychiatry (10k); Nurse Prac (3k); Psychiatry &amp; Neurology (1k)
Psychiatry (8k); Psychiatry &amp; Neurology (3k); Nurse Prac (3k)
Urology (8k); Phys Asst (1k); Nurse Prac (1k)
Neurology (7k); Nurse Prac (&lt;1k); Phys Asst (&lt;1k)
Pulmonary Disease (6k); Allergy/Immunology (2k); Otolaryngology (1k)
Neurology (3k); Nurse Prac (1k); Physical Med/Rehab (1k)</p>
      <p>Rheumatology (2k); Physical Med/Rehab (2k); Nurse Prac (2k)
means; clusters (rows) with top specialties (number).
complications. Our clustering identi ed radiation oncologists and urologists as
being similar according to the drugs they commonly prescribe, a
nding that
would not be identi ed through the use of a standard ontology alone. Future
work will involve exploration of comparisons with other scalable clustering
methods and of pertinence to clinical and workforce questions, including nurse
practitioner and physician assistant implicit specialty characterization and opioid
prescription networks.
ysis. ICML (2016)
3. Weiss, J.C.: Machine learning for clinical risk: wavelet reconstruction networks for
Simultaneous deep learning and clustering. ICML (2017)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Medicare</given-names>
            <surname>Provider</surname>
          </string-name>
          Utilization and
          <string-name>
            <given-names>Payment</given-names>
            <surname>Data: Part D Prescriber Summary Ta2. Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.O.</given-names>
            ,
            <surname>Jegelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Rathod</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Murphy</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          :
          <article-title>Deep metric learning via facility location</article-title>
          .
          <source>CVPR</source>
          (
          <year>2017</year>
          )
          <article-title>marked point processes</article-title>
          . working paper (
          <year>2018</year>
          )
          <article-title>4</article-title>
          .
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farhadi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Unsupervised deep embedding for clustering anal5</article-title>
          .
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidiropoulos</surname>
            ,
            <given-names>N.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Towards k-means-friendly spaces:</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>