Learning Patient Similarity Using Joint Distributed Embeddings of Treatment and
                                    Diagnoses
        Christopher Ormandy                         Zina M. Ibrahim                               Richard JB Dobson
      Department of Informatics                 Department of Biostatistics                   Department of Biostatistics
        King’s College London                     & Health Informatics                           & Health Informatics
    christopher.ormandy@kcl.ac.uk                King’s College London                          King’s College London
                                                 zina.ibrahim@kcl.ac.uk                       richard.j.dobson@kcl.ac.uk

                         Abstract                                      the chosen system to create a compact and continuous rep-
                                                                       resentation of patients enabling: 1) efficient feature process-
     We propose the use of vector-based word embed-                    ing and 2) some degree of generalization in finding similar-
     ding models to learn a cross-conceptual represen-                 ity between patients given their features. Using our model,
     tation of medical vocabulary. The learned model                   we would be able to reach the natural conclusion of a patient
     is dense and encodes useful knowledge from the                    diagnosed with Diabetes being similar to a patient receiving
     training concepts. Applying the embedding to the                  insulin treatment. This is a non-trivial exercise for a machine
     concepts of diagnoses and medications, we then                    learning algorithm, as we understand that the two cases are to
     show that they can then be used to measure similari-              some degree the same abstract concept expressed across two
     ties among patient prescriptions, leading to the dis-             different domains (diagnoses vs. treatment).
     covery of in- formative and intuitive relationships                  The paper is structured as follows. After a brief illustration
     between patients.                                                 of the required background in Section 2, we discuss our archi-
                                                                       tecture in Section 3. In Section 4, we show the results of train-
1   Introduction                                                       ing the resulting neural network model on a large database of
In simple word representation techniques such as the Ngram             intensive care unit medical records. We conclude with ongo-
model [Brants et al., 2007], words are regarded as single              ing work and future directions in Section 5.
atomic units, and no notion of similarity between words ex-
ists. Conversely, distributed word representations in vec-             2 Background
tor space provide an explicit grouping of similar words to
achieve high performance in Natural Language Processing
                                                                       2.1 Vector-based Word Representation
tasks [Rumelhart et al., 1988]. Such embeddings rely on vec-           A well-established approach for representing concepts to fa-
tor operations to represent learned word proximities or sim-           cilitate learning is the use of a fixed dimension, real valued
ilarities [Mikolov et al., 2013] and have have been used to            vector representing words. Each entry of this vector corre-
efficiently learn high-quality word vectors from very large            sponds to some feature in a hypothetical latent space, render-
datasets (containing billions of words) using a vocabulary             ing the size of the vector to be the dimensions of the feature
containing millions of words [Collobert and Weston, 2008;              space used to represent a single word.
Bengio and Usunier, 2011; Socher et al., 2011; Glorot et al.,             For example, creating a 5-dimensional representation of
2011; Turney and Pantel, 2010; Turney, 2013].                          prescriptions such as Aspirin, Ibuprofen, and Insulin, we
   Recently, [Mikolov et al., 2013] has introduced a neural            could decide on features such as “Heart problems,” “Pain
network design using distributed word representations to cap-          killer,” “Kidney Problems,” “Critical Importance medication”
ture interesting features such as linguistic regularities and          and “Preventative treatment.” In this example, Aspirin would
patterns. The architecture, named the Skip-Gram model, is              rank moderately for “Heart,” quite highly “Pain killer,” rel-
trained to find word representations of a given (input) word           atively lowly for “Kidney Problems,” perhaps low to mod-
that are useful in predicting its surrounding words in a sen-          erately for ”Critical Importance” and moderately to high on
tence or a document. The vector representation used in the             “Preventative.” Normalizing the values of an arbitrary patient
Skip-Gram model highly increases the network’s training ef-            (by assuming a vector length of unity) gives the vector shown
ficiency, with the ability to train 100 billion words in single        in Table 1.
optimized machine [Mikolov et al., 2013].                                 In practice, we do not suggest the nature of each feature,
   Our idea lies in using a Skip-Gram model to learn a com-            but merely supply the number of them - a neural network or
pact representation of patient features. Using an initial model        another approach then learns these features so as to serve its
with medications and diagnoses as features, we propose a               needs best. However, the basic premise is the same - each
scheme to embed top-level ICD 9 codes of patient prescrip-             feature has some meaning in the hypothetical latent space
tions and diagnoses within the same continuous representa-             learned by the network, and so similar values in the same
tional space. We then build a skip-gram representation using           position indicate two samples both share some aspect of this

                                                                  30
                  Drug       Heart      Pain killer   Kidney Problem         Critical Importance        Preventative
                 Aspirin     0.57          0.74            0.04                      0.12                   0.33
                 Insulin     0.07          0.07            0.64                      0.54                   0.54
                Ibuprofen     0.1          0.99            0.05                      0.05                   0.05

                                        Table 1: Example of normalized manual encoding


feature. Examples which share a large number of features are          3.1 The Skip-Gram Model
therefore closer than those which share only a few, as a con-         The details of our implementation are largely based on the
sequence of this encoding, which is the mechanism by which            skip-gram model [Mikolov et al., 2013; Rumelhart et al.,
similarity is explicitly encoded as Table 2 shows.                    1988] and is shown in Figure 1. The implemented logistic
                                                                      regression classifier receives as input an ID corresponding to
                  Drug Pair           Similarity                      an item in our vocabulary (in this case a list of all the ICD 9
               Aspirin - Insulin         0.36                         codes for diagnoses and Medications). This ID corresponds
              Aspirin - Ibuprofen        0.82                         to the Drug Embedding, which is a row within our Embed-
                                                                      ding matrix. Using the embedding lookup(...) functional-
         Table 2: Example of embedding similarity                     ity in tensorflow, we retrieve the 100-dimensional Embedding
                                                                      for the input, multiply it by a weight vector and pass through
                                                                      a softmax function. The input-output pairs are created to be
                                                                      all permutations of pairs that appear together in the same set.
2.2    Learning via the Skip-Gram Model                               For example, if a patient was prescribed medications A, B,
                                                                      and diagnosis D, we create input-output pairs as: (A, B), (A,
The Skip-Gram model is based on the goal of finding word
                                                                      D), (B, A), (B, D), (D, A). We aggregate all these input-output
representations that would enable the prediction of surround-
                                                                      pairs across all patients in the training set and use them to per-
ing words of a given word in a sentence. The idea is for any
                                                                      form mini-batch back-propagation on the embedding matrix
’candidate’ word found in the training vocabulary; we can as-
                                                                      and logistic regression parameters simultaneously. As pro-
sociate the most likely ’context’ word such that the two words
                                                                      posed by Mikolov et al. [Mikolov et al., 2013], we use Noise
show the maximum association.
                                                                      Contrastive Estimation to approximate the loss at each step of
   Formally,           given     a    sequence    of     words        training, to improve the efficiency of computation, and built
w1 , w2 , ..., wn , ..., wN , the Skip-Gram model will train          our model in tensorflow.
a multi-class logistic regression so that for each candidate
word wn , we can find a ’context word’ wn + j falling                 3.2 Patient Similarity Using Unsupervised
within the window of c words before or after wn such that                 Embeddings
the probability of P (wn+j |wn ) is maximum [Mikolov et
                                                                      Using the unsupervised joint embeddings, we show that
al., 2013]. In other words, the Skip-Gram model aims to
                                                                      meaningful patient similarities can be discovered within the
maximum the average log probability:
                                                                      data. To do this, we train the prescription and diagnosis joint
                 N                                                    embeddings in the manner described in the previous section,
             1 X        X
                                                                      on a subsection of the data (100,000) prescriptions, and then
                                  logP (wn+j |wn )
             N n=1                                                    draw patients randomly from the remaining portion of the
                     cjc,j6=0
                                                                      data. We then aggregate all the prescriptions given on a daily
   c is the size of the training context and is used to adjust        basis to the patient during their stay and replace each one with
the model. Larger c values associate a wider context with             the relevant embedding trained previously.
a given candidate word, implying more training examples,                 To generate a treatment vector, we average all the individ-
slower training but better classification.                            ual drug embeddings for each day. By then taking a single
                                                                      days treatment vector, and computing the cosine similarity
                                                                      between that and other daily treatment vectors, we can find
3     Our Work: A Patient-Focused Skip-Gram                           the similarity between patient treatments.
      Model                                                              This is a well-established trick in NLP and is often a
                                                                      primary benchmark to compare other methods against, and
The work performed here is based on the idea of generalising          while it may not seem like the most sophisticated solution, it
vector-based embeddings to any number of medical concepts,            can be surprisingly effective.
regardless of whether or not they come from the same under-
lying distribution. The main implication of this is that the          4 Experiments & Results
features potentially become more general or invisible to us.
However, with related domains such as diseases and drugs,             4.1 Data Source & Preprocessing
we could imagine a normalized encoding as given in figure 1           The model was trained using the MIMIC dataset [Johnson et
for features spanning the two concepts of disease and medi-           al., 2016]. This is a large Intensive Care Unit (ICU) dataset
cation.                                                               containing the records of over 40,000 patients in the ICU

                                                                 31
                                            Figure 1: The Skip-Gram Architecture Used


of the Beth Israel Deaconess Medical Center, Boston, Mas-                 standard Adam as the optimization method and negative log
sachusetts, U.S.A. between 2005 and 2012. Prescriptions are               likelihood for the loss function.
registered alongside a unique and anonymized patient iden-                   In tensorflow, we initialized the weights randomly, with a
tifier, with a date range indicating the period this was to be            truncated normal distribution for weights and a random uni-
administered over.                                                        form for embeddings and biases. This is based upon conven-
    To train a neural network on patient prescriptions, one must          tional methodologies found to be most useful in a wide range
first extract the data and reshape it, which is a non-trivial task        of settings, as described in [LeCun et al., 2012].
for the way the data is presented in MIMIC iii. As shown in                  Following from [Mikolov et al., 2013], I use Noise con-
Figure 2, prescriptions are primarily indicated by a combina-             trastive estimation to improve the efficiency of the model. As
tion of hospital admission id, start date, end date and drug.             the model has many outputs (one for each entry in the ’vo-
                                                                          cabulary’), computing the softmax at each stage is compu-
4.2   Tensorflow Implementation                                           tationally expensive. As most of the entries are in fact not
The first step was to aggregate all the drugs by day and hos-             relevant (we have many classes, but most should be 0, and
pital admission id, to compile a list of concepts to be used per          we want only a single entry that is substantially non-zero),
day, as shown in Figure 3. Each day defines a context window              we can improve the computation efficiency by sampling the
for that patient, so if a patient received drugs A, B and C on            loss function rather than computing it exhaustively. There are
a given day, the input output pairs for the network are (A, B),           two ways to achieve this in practice, one is with a sampled
(A, C), (B, A), (B, C), (C, A) and (C, B).                                softmax, which essentially computes a Monte Carlo estimate,
   Next, we assign each concept an arbitrary ID, with 0 re-               and Noise contrastive estimation which picks examples of the
served for an ’unknown’ entry. This allows unseen concepts                positive and negative classes so as to get an estimate that way.
to be included after training time. Each ID maps to a row in
a randomly initialized embedding matrix, which has dimen-                 4.3 Results
sions (number of drugs x embedding size). This embedding                  Evaluating Prescription Embeddings
matrix is then used as inputs to logistic regression classifier,          As this is an unsupervised approach, quantitative evaluation
which performs a one-hot prediction for the output concept,               of the results is difficult. To assess if the neighbourhoods
with size (number of drugs,). This is displayed mathemati-                are correct, most previous work either appeals to experts to
cally in 2.                                                               evaluate the quality or avoids this altogether and leaves the
                                                                          reader to judge for themselves [Mikolov et al., 2013].
              E = embedding lookup(X)                         (1)            To provide a qualitative evaluation of the results, we took
                                                                          the top occurring drugs and found the nearest neighbours to
                ŷ = sof tmax(E · W + b)                      (2)
                                                                          them using cosine similarity as a measure.
   This system is trained via back propagation, and it simul-                These nearest neighbour relationships show some useful
taneously learns both the W and b parameters and the values               similarity between drugs. For example, we see salts and elec-
of the embedding matrix. Once training is complete, the em-               trolytes naturally grouping together (e.g. Potassium Chloride
bedding matrix acts as a lookup dictionary - to get the repre-            and Magnesium Sulfate). Aspirin is close to two statins -
sentation for a particular drug, simply find the ID it maps to            drugs which try to treat blood pressure and alleviate the risks
and extracts this row from the embedding matrix. I used the               of heart attack or similar problems. Metoclopramide is used

                                                                     32
                                           Figure 2: Example format of prescriptions

                             Drug                Nearest Neighbour                2nd Nearest Neighbour
                      Potassium Chloride         Magnesium Sulfate                  Calcium Gluconate
                       Morphine Sulfate           Acetaminophen                  Oxycodone-Acetaminophen
                       Docusate Sodium       Sodium Chloride 0.9% Flush               Acetaminophen
                      Calcium Gluconate          Potassium Chloride                 Magnesium Sulfate
                            Aspirin                  Simvastatin                       Atorvastatin
                       Metoclopramide                Ranitidine                        Nitroglycerin
                       Amiodarone HCl           D5W (EXCEL BAG)                     Phenylephrine HCl
                       Heparin Sodium                 Warfarin                          Ibuprofen

                                      Table 3: Nearest Neighbours for Drug Embeddings


                                                                      embeddings encodes the same relevant information seen in
                                                                      the results for single embeddings, while also providing links
                                                                      between diagnoses codes and drugs. For more broad rang-
                                                                      ing drugs, such as painkillers, we see a clustering that is
                                                                      not particularly associated with a single ICD9 code, for ex-
                                                                      ample, Bisacodyl is close to Docusate Sodium and Mor-
                                                                      phine Sulphate. This also shows another interesting artifact
                                                                      of this method - docusate sodium is not a painkiller, but is
                                                                      ’close’ to bisacodyl because they often appear together. Ac-
          Figure 3: Aggregated daily prescriptions                    etaminophen, Meperidine, and Morphine Sulfate are another
                                                                      cluster of pain relief medications which do not appear ’close’
                                                                      to a particular ICD9 diagnosis code.
to treat acid reflux, a stomach complaint, and Ranitidine is             We see interesting clustering of ICD9 codes - 427, 428 and
used to reduce the amount of stomach acid produced.                   414 all representing heart problems for example. We also see
   We also see relationships between items that often appear          cross group clusters, which put Diabetes and Insulin close
together even if they are not direct replacements. For exam-          together, as well as Aspirin and heart disease.
ple, Amiodarone HCL is an antiarrhythmic drug, used to treat
issues with irregular heartbeats, and its nearest neighbour is        Evaluating Patient Similarity
D5W. D5W is a code for Dextrose 5% and water, which is                Finding patients who shared a similar daily treatment vec-
essentially just a carrier for IV lines and similar methods of        tors worked well to find patients of similar types. Due to
delivery. These two are near as it is common within the data          the nature of the ICU, many patients received a large number
to administer Amiodarone HCL as a solution with D5W.                  of drugs, and using embeddings rather than a one hot style
                                                                      approach allows for meaningful entries to be more discrim-
Joint Embeddings                                                      inative. We selected patients at random, and then picked a
As with the prescription only embeddings, proving these en-           random day for that patient, and computed the cosine simi-
code useful information in a quantitative way is somewhat             larity between that daily treatment vector and all other daily
complicated. We follow the same approach as the previous              treatment vectors for all patients. As expected, other days
section and provide some of the nearest neighbours for com-           from that patients stay in the ICU rank very highly in many
mon entries in the data, and also, in the next section, show          cases. However, even if we look only at other patients, we
that these embeddings are useful for the task of finding pa-          see meaningful groupings occurring. Some examples are in-
tients with similar treatments, as a way to demonstrate that          cluded in Table 5. Similarities are Cosine similarities of nor-
they encode relevant information.                                     malised vectors, and so they vary between 100% and -100%.
   As can be seen in table 4, the approach of using joint             A similarity of 100% means the same, while -100% indicates

                                                                 33
                           Entry                           Nearest Neighbour                   2nd Nearest Neighbour
                        Bisacodyl                           Docusate Sodium                        Morphine Sulfate
                    Calcium Gluconate                             SW                                     D5W
                     Acetaminophen                            Meperidine                           Morphine Sulfate
                          Insulin                        250 (Diabetes mellitus)                    Tamsulosin HCl
                427 (Cardiac dysrhythmias)            428 (Congestive heart failure)             414 (Other forms of
                                                                                            chronic ischemic heart disease)
             276 (Disorders of fluid electrolyte)     530 (Diseases of esophagus)            790 (Nonspecific findings on
                                                                                                examination of blood)
                401 (Essential hypertension)              746 (Other congenital                    272 (Disorders of
                                                           anomalies of heart)                    lipoid metabolism)
                           Aspirin                        Clopidogrel Bisulfate                  414 (Other forms of
                                                                                            chronic ischemic heart disease)
                    Pantoprazole Sodium               Iso-Osmotic Sodium Chloride                Magnesium Sulfate
                      Morphine Sulfate                       Acetaminophen                   Oxycodone-Acetaminophen
                          Heparin                              Guaifenesin                               Senna
                         Lorazepam                       Chlorhexidine Gluconate                Diphenhydramine HCl
                         Metoprolol                             Cefazolin                              Warfarin
                   250 (Diabetes mellitus)                       Insulin                         414 (Other forms of
                                                                                            chronic ischemic heart disease)

                                 Table 4: Nearest Neighbours for Drug & Diagnosis Embeddings


complete opposites. As such, we can show example pairs                    6 Acknowledgements
which are similar, and those which are also not similar.
                                                                          The authors would like to acknowledge the supports of the
   As shown in Figure 4, if we conduct PCA on the 100-                    NIHR Biomedical Research Centre for Mental Health, the
dimensional embeddings to reduce the dimensionality down                  Biomedical Research Unit for Dementia at the South Lon-
to two principle components, we can visualise the ’closeness’             don, the Maudsley NHS Foundation Trust and Kings College
of a small number of entries. The figure shows that the se-               London, and a joint infrastructure grant from Guys and St
lected entries fall into 3 or 4 clusters. Diabetes is close to            Thomas Charity and the Maudsley Charity, London, United
Insulin, which two neonatal drugs (denoted by NEO*) are                   Kingdom.
close to the diagnosis ICD 9 code for electrolyte imbalance                  This research was also supported by researchers at the Na-
- a condition that strongly associates with newborns in the               tional Institute for Health Research University College Lon-
dataset. The other cluster corresponds to pain medication                 don Hospitals Biomedical Research Centre, and by awards
(and docusate sodium which is used to alleviate constipation,             establishing the Farr Institute of Health Informatics Research
a common side effect of many pain medications). This cluster              at UCLPartners, from the Medical Research Council, Arthri-
seems to split into two subclusters, potentially corresponding            tis Research UK, British Heart Foundation, Cancer Research
to differing use of the different drugs, but likely an artifact of        UK, Chief Scientist Office, Economic and Social Research
the dimensionality reduction.                                             Council, Engineering and Physical Sciences Research Coun-
                                                                          cil, National Institute for Health Research, National Institute
                                                                          for Social Care and Health Research, and Wellcome Trust.
5   Conclusions and Ongoing Work
                                                                          References
The idea and initial results presented here are part of our on-           [Bengio and Usunier, 2011] Jason Weston Samy Bengio and
going work of finding compact representations of patients                   Nicolas Usunier. Wsabie: Scaling up to large vocabulary
using their electronic hospital records, to make use of the                 image annotation. In International joint conference on Ar-
massive deluge of longitudinal data available to understand                 tificial Intelligence (IJCAI), page 27642770, 2011.
the densest set of features representing a given patient. Our             [Brants et al., 2007] T Brants, C Popat, F Xu, J Och, and
long-term goal is to complement endeavors in personalized
                                                                            J Dean. Large language models in machine translation.
medicine by devising patient similarity measures that can be
                                                                            In the Joint Conference on Empirical Methods in Nat-
used to supplement scientific inquiry and which can translate
                                                                            ural Language Processing and Computational Language
into actionable knowledge in the bedside. For instance, we
                                                                            Learning, 2007.
would like to be able to answer questions such as Given that
patient X is similar to patient Y, will X respond to treatment            [Collobert and Weston, 2008] Ronan Collobert and Jason
Z similarly to patient Y? Or why has patient A responded dif-               Weston. A unified architecture for natural language pro-
ferently than patient B to the same treatment provided?                     cessing: deep neural networks with multitask learning.

                                                                     34
                                           Figure 4: PCA on selected embeddings

                         Patient 1 Diagnoses        Patient 2 Diagnoses           Treatment Similarity
                              Newborn                     Newborn                         99%
                            Stab Wounds            Motor Vehicle Accident                 89%
                             Pneumonia                    Hypoxia                         85%
                              Diaherria                     Fever                         85%
                        Urinary Tract Infection        Renal Failure                      78%
                           Encephalopathy          Congestive Heart Failure              -41%
                         Incarcerated Hernia               Sepsis                        -40%
                              Newborn                  Arterial Injury                   -35%

                                 Table 5: Similarity between selected daily treatment vectors


  In International conference on Machine learning, page                 ity. In Advances in neural information processing systems,
  160167, 2008.                                                         pages 3111–3119, 2013.
[Glorot et al., 2011] Xavier Glorot, Antoine Bordes, and             [Rumelhart et al., 1988] David E Rumelhart, Geoffrey E
  Yoshua Bengio. Domain adaptation for large-scale sen-                 Hinton, and Ronald J Williams. Learning representations
  timent classification: A deep learning approach. In Inter-            by back-propagating errors. Cognitive modeling, 5(3):1,
  national Conference on Machine Learning (ICML), page                  1988.
  513520, 2011.                                                      [Socher et al., 2011] Richard Socher, Cliff Lin, Andrew Ng,
                                                                        and Christopher Manning. Parsing natural scenes and nat-
[Johnson et al., 2016] Alistair EW Johnson, Tom J Pollard,
                                                                        ural language with recursive neural networks. In Interna-
   Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad                    tional Conference on Machine Learning (ICML), 2011.
   Ghassemi, Benjamin Moody, Peter Szolovits, Leo An-
   thony Celi, and Roger G Mark. Mimic-iii, a freely ac-             [Turney and Pantel, 2010] Peter D. Turney and Patrick Pan-
   cessible critical care database. Scientific data, 3, 2016.           tel. From frequency to meaning: Vector space models
                                                                        of semantics. Journal of Artificial Intelligence Research,
[LeCun et al., 2012] Yann A LeCun, Léon Bottou,                        37:141–188, 2010.
  Genevieve B Orr, and Klaus-Robert Müller.          Effi-          [Turney, 2013] Peter Turney. Distributional semantics be-
  cient backprop. In Neural networks: Tricks of the trade,
                                                                        yond words: Supervised learning of analogy and para-
  pages 9–48. Springer, 2012.
                                                                        phrase. Transactions of the Association for Computational
[Mikolov et al., 2013] Tomas Mikolov, Ilya Sutskever, Kai               Linguistics (TACL), page 353366, 2013.
  Chen, Greg S Corrado, and Jeff Dean. Distributed repre-
  sentations of words and phrases and their compositional-

                                                                35