=Paper= {{Paper |id=Vol-2142/short2 |storemode=property |title=Bicluster phenotyping of healthcare providers, procedures, and prescriptions at scale with deep learning |pdfUrl=https://ceur-ws.org/Vol-2142/short2.pdf |volume=Vol-2142 |authors=Nathanael Fillmore,Ansh Mehta,Jeremy Weiss |dblpUrl=https://dblp.org/rec/conf/ijcai/FillmoreMW18 }} ==Bicluster phenotyping of healthcare providers, procedures, and prescriptions at scale with deep learning== https://ceur-ws.org/Vol-2142/short2.pdf
 Bicluster phenotyping of healthcare providers,
procedures, and prescriptions with deep learning

          Nathanael Fillmore1 , Ansh K. Mehta2 , and Jeremy C. Weiss3
1
    Department of Veterans Affairs & Dana-Farber Cancer Institue, Boston, MA, USA
             2
               University of South Alabama Hospitals, Mobile, AL, USA
          3
            Carnegie Mellon University, Heinz College, Pittsburgh, PA, USA



        Abstract. We consider the task of biclustering healthcare providers
        (≈ 106 ), procedures (≈ 104 ), and drugs (≈ 103 ) based on 2015 counts
        of claims by provider. The problem with many clustering methods is the
        scale of the data and the sparse count data where the notion of “dis-
        tance” is unclear. We use neural networks, which are frequently used
        for representation learning, to represent the sparse high-dimensional in-
        formation with a low-dimensional embedding. While previous work has
        performed standard clustering methods in the embedding space, we in-
        stead leverage a low-dimensional embedding with geometric constraints
        that enables simultaneous image recovery and cluster determination.


1     Introduction
    We consider the task of biclustering healthcare providers (≈ 106 ), proce-
dures (≈ 104 ), and drugs (≈ 103 ). Standard ontologies exist that group medical
providers, procedures, and prescriptions into categories, but these categories do
not always correspond to clusters that are relevant for every question of interest,
and some standard categories, like Internal Medicine, are quite broad. Instead,
we would like a tool that captures practice characteristics de-novo to compare
against collected measures and ontologies. This has several applications: we can
measure concordance of the existing hierarchies with those identified in data,
we can look at changes in practice patterns over time, we can investigate dif-
ferences in practice within and across nominal subgroups based on practice and
utilization, and we can investigate co-occurrence.
    Many clustering methods perform poorly on this task because of the scale
of the data and because, in sparse count data, the most appropriate notion of
“distance” is not entirely clear. Additionally, centroid-based clustering in our
sparse count data space is problematic because identifying an appropriate cen-
troid location is non-obvious. Using the (possibly weighted) average value of
members assumes an L2 space which is problematic for high dimensional, sparse
count data. Moreover, it is well known that distances tend to be similar in high
dimensions, and this problem is only made worse by sparse count data.
    In this work, we describe a method that learns a low-dimensional embed-
ding with geometric constraints that enables simultaneous image recovery and
cluster determination. We propose a generic architecture that simultaneously
2        Fillmore, Mehta, and Weiss

(1) optimizes image recovery in an autoencoding framework, (2) creates a low-
dimensional embedded representation of the high-dimensional space, (3) assigns
examples to clusters in a soft clustering, and (4) optimizes the quality of this
clustering.
    Several authors have recently proposed
methods that are conceptually similar to our
contribution. For example, [5] also formulates
a method that integrates a clustering module
in a deep learning framework, as does [4]. How-
ever, both these methods are based on identify-
ing centroids in the latent space, which is not
appropriate for our data. Other related work
includes methods to learn metric embeddings
with neural networks, e.g., [2], but as these
methods do not incorporate a clustering objec- Fig. 1. Autoencoder with clus-
tive in the neural network, the learned embed- tering module. Nodes are Input
ding is not necessarily suitable for clustering. (I), Output (O), Hidden (H), and
    In our application, cosine similarity is ap- Cluster (C). H and O are con-
propriate, so we use it in our examples. Many strained by batch cosine similar-
                                                           ◦
healthcare datasets consist of sparse count ity. H (h ) and C are constrained
data in high-dimensional spaces as a conse- by batch cosine similarity.
quence of counting visits, billing codes, proce-
dure codes, drug prescriptions, etc. Our tool is well-suited to data of this nature.

2    Method
    Consider a sparse count matrix with N rows and K columns. Define M
an N × K as a matrix with elements log transformed using the function f (x) =
log(1+x). We are interested in clustering over rows (providers) and columns (pre-
scriptions and procedures). We will consider clustering across rows and columns
separately. For each case, we construct an autoencoder consisting of an encoder
Ψ1 and a decoder Ψ2 . Define the hidden representation h of size H such that
Ψ1 (M ) = h and M̂ as the reconstruction given by Ψ2 ◦ Ψ1 (M ) = M̂ . Define Ψ3
as the function transformation for the clustering module that takes as input h
and outputs cluster probabilities c: Ψ3 (h) = c. Fig. 1 illustrates the framework
for the neural network.
    For log-transformed sparse count data, cosine similarity is a useful distance
representation. However, computing the cosine similarity for all pairs may be
problematic when N is large: O(N 2 ). We propose to learn a hidden represen-
tation h = {h◦ , h||·|| } where the cosine similarity between examples in M is
approximately preserved in h◦ of size H − 1 units and where the final unit h||·||
captures information in the norm. To achieve this, we define batch cosine simi-
larity (O(B 2 )) and its loss Lh◦ ,O as a penalization to the autoencoder loss.
    Define batch size B such that hB and M̂B are of size B × {·} for H and M
respectively. Let δ(xi , xj ) be the pairwise cosine similarity. Define δB the batch
cosine similarity, i.e. δB ({x1 , . . . , xB }) = [δ(xi , xj )]ij ∀i, j ∈ {1, . . . , B}. Then,
Lh◦ ,O = B1 ||δB (h◦B ) − δB (M̂B )||22 .
        Bicluster deep phenotyping of providers, procedures and prescriptions              3

    While Lh◦ ,O encourages matching               Layer             Index   Size     Act.
                                                   (Input)             –    B×K
angles in h◦ and M̂ , two vectors in h◦            Dense               0    B×H     LReLU
could have the same cosine angle but               Haar wavelet        –    B×H        –
                                                   Expand, permute     –   B×H×P       –
different embedding locations. Note                Pool                1    B×H        –
this could be problematic because the              Sum  (0) and (1)    –    B × H      –
                                                   Batch norm          –    B×H        –
hidden vectors are used to learn clus-             Dense               –    B×H     LReLU
ter membership probabilities c. To en-             Batch norm          –    B×H        –
                                                   Dense               –    B×H     LReLU
courage approximate injectivity, we                Batch norm          –    B×H        –
introduce the loss L||·||=1 that penal-            Dense               –    B×H     LReLU
                                                   Dense               –    B×H     LReLU
izes hidden representations away from              (Hidden)            2    B×H
the surface of the unit norm hyper-                Dense               –    B × H   LReLU
                                                   Dense               –    B×H     LReLU
sphere. We could enforce this as a                 Dense               –    B×K     LReLU
hard constraint, however, the ability              (Output)            –    B×K
to violate the constraint may facilitate           Copy (2)            –    B×H        –
alignment of the embedded represen-                Dense               –    B×H     LReLU
                                                   Dense               –    B×H     LReLU
tation angle with that of the output               Dense               –    B×C     Softmax
space.                                             (Cluster)           –    B × C
    The hidden representation h◦ pre-               Table 1. Deep network architecture
serves angular distances of the output
space, and its unit norm and approximate injectivity are useful for our cluster-
ing. Note that for a probability vector that sums to 1, its element-wise square
root vector has unit norm and can be interpreted as an angle. Therefore, we can
                                             1
match the angular representation of c 2 to that of h◦ with batch cosine similarity.
    Define cB = Ψ3 (hB ) the set of probability vectors indicating cluster mem-
                                                         1
bership. Note that the cosine similarity of cB 2 is non-negative and we are not
interested in incurring loss due to differences between 0 and negative cosine
                                                 1
similarities. We define Lc,h◦ = B1 ||δB (cB 2 ) − max(δB (h◦B ), 0)||22 .
    We add two more loss terms to assist representation learning: h spread and
c entropy. For level sets of batch cosine loss, we prefer that embedded vectors
in h be spread apart to minimize coincidental overlap for suboptimally located
members of h and for future unseen points from the underlying distribution.
We also impose an entropy loss term to encourage cluster probabilities to be
spread across more than one cluster to encourage exploration in cluster              Pmem-
bership and avoid local optima. Thus, our overall objective function is i λi Li
for Li ∈ {LM,M̂ , L||h◦ ||=1 , Lh◦ ,O , Lc,h◦ , Lspread , Lentropy }. We set λi respectively:
λi ∈ {1, 10−1 , 10−1 , 1, 10−4 , 10−4 }.
    The autoencoder framework we adopt is shown in Table 1. The permute-
pool layer [3] copies the tensor P times, permutes the values, and performs
max-pooling over the P dimension with size 3 and stride 2. For our experiments
we set B = 128, P = 4, and H = 128. We set C to be twice the desired
number of centroids, and in post-processing merge the clusters identified based
on maximum pairwise cluster cosine similarity.

3    Results
   Evaluation on simulated data. We evaluate our method on real and
simulated data. First, on simulated data generated so as to be similar to our
target healthcare application, we show that our method typically finds a clus-
4       Fillmore, Mehta, and Weiss

                         Rand          Jaccard         Recall        Precision
    Data            Ours OKM HKM Ours OKM HKM Ours OKM HKM Ours OKM HKM
    baseline        1.00 0.84 1.00 1.00 0.12 0.99 1.00 0.14 1.00 1.00 0.58 1.00
    samples=100     1.00 0.85 0.91 0.90 0.11 0.17 0.90 0.12 0.19 1.00 0.60 0.65
    dims=100000     0.93 0.21 0.48 0.08 0.04 0.04 0.14 0.04 0.05 0.17 0.87 0.62
    explode=1000000 1.00 0.84 1.00 1.00 0.13 1.00 1.00 0.14 1.00 1.00 0.59 1.00
    centroids=100   0.99 0.81 1.00 0.47 0.03 0.95 0.47 0.03 0.96 1.00 0.64 0.99
    sd=0.1          0.93 0.28 0.78 0.08 0.04 0.05 0.15 0.04 0.06 0.15 0.79 0.29

Table 2. Quantitative evaluation on simulated data, relative to the true cluster la-
bels. Data are generated as described in the text, with baseline parameters set at
centroids=25, samples=1000, dims=1000, sd=0.01, explode=10000, and varied in each
dataset as indicated. Data are clustered using our method (Ours), k-means in the orig-
inal, input space (OKM), and k-means in the hidden, embedded space we construct
(HKM). The best score is in boldface. Our method almost always outperforms k-means
in the original space, and usually outperforms k-means in the hidden space.

tering that, under several measures, is more similar to the ground truth labels
from the simulation than competing methods are.
    We generate data as follows. Centroids are sampled from a multivariate nor-
mal N (0, 1) and normalized onto the unit ball. We generate an equal number
of samples for each centroid by adding noise distributed according to N (0, σ 2 I).
Then we translate these points away from the origin by multiplying each point
by an “explosion” factor κ drawn from a random uniform on [1, κ]. Thus, the
simulation is parameterized by the number of centroids, the number of dimen-
sions, the explode factor, the number of samples, and the standard deviation of
the noise. We set each parameter to a baseline level and vary one at a time to
test our algorithm.
    We cluster the data using our method, k-means in the original simulated
data space, and k-means in the hidden space. We evaluate results of these clus-
tering methods using precision, recall, the Rand index, and the Jaccard index,
all relative to the ground truth labels from the simulation. Results are shown in
Table 2. By these measures, our method almost always outperforms k-means in
the original space, and usually outperforms k-means in the hidden space.
Evaluation on real healthcare data. As a case study to demonstrate our
method’s real world significance, we used our method to bicluster healthcare
providers, prescriptions, and procedures based on the number of prescriptions
and procedures administered by each provider, and the number of providers
administering each prescription or procedure as recorded.
    Specifically, we used Medicare Provider Utilization and Payment Data: Part
D Prescriber Summary Table CY2015 [1], which tabulates all prescriptions given
under the Medicare Part D program in 2015 in the United States. In the interest
of space, we only show results for a clustering of providers based on the pre-
scriptions they gave. Our physician authors assessed the quality of the resulting
clusters.
    The clusterings formed by our method and by k-means, each with 20 clusters,
are shown in Table 3. Our method produces qualitatively better clusters than
k-means does. For example, our clustering consistently includes a larger frac-
tion of specialists in the specialist cluster, e.g., Dentist (85k), Psychiatry (21k),
Emergency Medicine (20k). Our clustering, unlike k-means, also identifies clean
        Bicluster deep phenotyping of providers, procedures and prescriptions                    5

obstetrics and hematology oncology clusters. K-means identifies a cardiology and
interventional cardiology cluster that our method does not, however our clus-
tering identified this cluster and merged it (Cardiology (18k), Nurse Prac (3k),
Internal Medicine (2k)) into the internal medicine subgroup in post-processing.
    Our method also pro-      (a) Our Clustering (Top 3 Specialties)
                              Dentist (85k); Oral Surgery (3k); Podiatry (1k)
vides insight in regard       Internal Med (82k); Family Practice (82k); Nurse Prac (44k)
                              Psychiatry (21k); Nurse Prac (9k); Psychiatry & Neurology (6k)
to providers in ontology      Emergency Medicine (21k); Orthopedic Surgery (15k); Phys Asst (15k)
                              Optometry (18k); Ophthalmology (17k); Student (<1k)
specialties that are not      Obstetrics/Gynecology (17k); Nurse Prac (2k); Phys Asst (<1k)
                              Gastroenterology (12k); Nurse Prac (3k); Internal Med (3k)
as common. For exam-          Dermatology (10k); Phys Asst (3k); Nurse Prac (1k)
                              Neurology (10k); Nurse Prac (1k); Psychiatry & Neurology (1k)
ple, our method’s urol-       Urology (9k); Phys Asst (1k); Nurse Prac (1k)
                              Nurse Prac (8k); Phys Asst (6k); Emergency Medicine (6k)
ogy cluster also includes     Pulmonary Disease (7k); Allergy/Immunology (3k); Otolaryngology (2k)
                              Hematology/Oncology (6k); Nurse Prac (2k); Medical Oncology (2k)
a large fraction of the       Internal Med (5k); Emergency Medicine (3k); Nurse Prac (2k)
                              Phys Asst (4k); Nurse Prac (4k); Orthopedic Surgery (2k)
radiation oncologists in      Dentist (3k); Emergency Medicine (2k); Phys Asst (2k)
                              Infectious Disease (3k); Obstetrics/Gynecology (2k); Nurse Prac (2k)
our dataset. On detailed      Pharmacist (2k); Nurse Prac (1k); Internal Med (1k)
                              Podiatry (2k); Nurse Prac (1k); Optometry (1k)
assessment of this clus-      Physical Med/Rehab (2k); Podiatry (2k); Nurse Prac (2k)
ter, we found that urol-      (b) K-Means Clustering (Top 3 Specialties)

ogy medications tam-          Dentist (52k); Nurse Prac (1k); Phys Asst (1k)
                              Nurse Prac (35k); Phys Asst (27k); Internal Med (27k)
sulosin and finasteride       Family Practice (23k); Internal Med (17k); Nurse Prac (10k)
                              Family Practice (22k); Internal Med (20k); Nurse Prac (4k)
were most commonly            Emergency Medicine (20k); Orthopedic Surgery (13k); Phys Asst (12k)
                              Dentist (19k); Oral Surgery (3k); Maxillofacial Surgery (1k)
prescribed in this clus-      Internal Med (16k); Nurse Prac (12k); Family Practice (9k)
                              Family Practice (16k); Nurse Prac (13k); Internal Med (11k)
ter and that radiation        Cardiology (14k); Nurse Prac (1k); Interventional Cardiology (1k)
                              Dentist (14k); Oral Surgery (<1k); Infectious Disease (<1k)
oncologists most com-         Optometry (12k); Ophthalmology (5k); Student (<1k)
                              Ophthalmology (11k); Optometry (2k); Student (<1k)
monly prescribed tam-         Internal Med (11k); Family Practice (10k); General Practice (1k)
                              Psychiatry (10k); Nurse Prac (3k); Psychiatry & Neurology (1k)
sulosin, followed by hy-      Psychiatry (8k); Psychiatry & Neurology (3k); Nurse Prac (3k)
                              Urology (8k); Phys Asst (1k); Nurse Prac (1k)
drocodone/acetaminoph- Pulmonary
                              Neurology (7k); Nurse Prac (<1k); Phys Asst (<1k)
                                         Disease (6k); Allergy/Immunology (2k); Otolaryngology (1k)
en and dexamethasone,         Neurology (3k); Nurse Prac (1k); Physical Med/Rehab (1k)
                              Rheumatology (2k); Physical Med/Rehab (2k); Nurse Prac (2k)
possibly for prevention
                            Table 3. Provider clustering by (a) our method, (b) k-
and treatment of ra- means; clusters (rows) with top specialties (number).
diation therapy related
complications. Our clustering identified radiation oncologists and urologists as
being similar according to the drugs they commonly prescribe, a finding that
would not be identified through the use of a standard ontology alone. Future
work will involve exploration of comparisons with other scalable clustering meth-
ods and of pertinence to clinical and workforce questions, including nurse prac-
titioner and physician assistant implicit specialty characterization and opioid
prescription networks.

References
1. Medicare Provider Utilization and Payment Data: Part D Prescriber Summary Ta-
   ble CY2015 (2017 (accessed April 30, 2018)), https://data.cms.gov/Medicare-Part-
   D/Medicare-Provider-Utilization-and-Payment-Data-Par/qywy-pajd
2. Song, H.O., Jegelka, S., Rathod, V., Murphy, K.: Deep metric learning via facility
   location. CVPR (2017)
3. Weiss, J.C.: Machine learning for clinical risk: wavelet reconstruction networks for
   marked point processes. working paper (2018)
4. Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering anal-
   ysis. ICML (2016)
5. Yang, B., Fu, X., Sidiropoulos, N.D., Hong, M.: Towards k-means-friendly spaces:
   Simultaneous deep learning and clustering. ICML (2017)