=Paper=
{{Paper
|id=Vol-1891/paper5
|storemode=property
|title=Learning Patient Similarity Using Joint Distributed Embeddings of Treatment and Diagnoses
|pdfUrl=https://ceur-ws.org/Vol-1891/paper5.pdf
|volume=Vol-1891
|authors=Christopher Ormandy,Zina Ibrahim,Richard J. B. Dobson
|dblpUrl=https://dblp.org/rec/conf/ijcai/OrmandyID17
}}
==Learning Patient Similarity Using Joint Distributed Embeddings of Treatment and Diagnoses==
Learning Patient Similarity Using Joint Distributed Embeddings of Treatment and
Diagnoses
Christopher Ormandy Zina M. Ibrahim Richard JB Dobson
Department of Informatics Department of Biostatistics Department of Biostatistics
King’s College London & Health Informatics & Health Informatics
christopher.ormandy@kcl.ac.uk King’s College London King’s College London
zina.ibrahim@kcl.ac.uk richard.j.dobson@kcl.ac.uk
Abstract the chosen system to create a compact and continuous rep-
resentation of patients enabling: 1) efficient feature process-
We propose the use of vector-based word embed- ing and 2) some degree of generalization in finding similar-
ding models to learn a cross-conceptual represen- ity between patients given their features. Using our model,
tation of medical vocabulary. The learned model we would be able to reach the natural conclusion of a patient
is dense and encodes useful knowledge from the diagnosed with Diabetes being similar to a patient receiving
training concepts. Applying the embedding to the insulin treatment. This is a non-trivial exercise for a machine
concepts of diagnoses and medications, we then learning algorithm, as we understand that the two cases are to
show that they can then be used to measure similari- some degree the same abstract concept expressed across two
ties among patient prescriptions, leading to the dis- different domains (diagnoses vs. treatment).
covery of in- formative and intuitive relationships The paper is structured as follows. After a brief illustration
between patients. of the required background in Section 2, we discuss our archi-
tecture in Section 3. In Section 4, we show the results of train-
1 Introduction ing the resulting neural network model on a large database of
In simple word representation techniques such as the Ngram intensive care unit medical records. We conclude with ongo-
model [Brants et al., 2007], words are regarded as single ing work and future directions in Section 5.
atomic units, and no notion of similarity between words ex-
ists. Conversely, distributed word representations in vec- 2 Background
tor space provide an explicit grouping of similar words to
achieve high performance in Natural Language Processing
2.1 Vector-based Word Representation
tasks [Rumelhart et al., 1988]. Such embeddings rely on vec- A well-established approach for representing concepts to fa-
tor operations to represent learned word proximities or sim- cilitate learning is the use of a fixed dimension, real valued
ilarities [Mikolov et al., 2013] and have have been used to vector representing words. Each entry of this vector corre-
efficiently learn high-quality word vectors from very large sponds to some feature in a hypothetical latent space, render-
datasets (containing billions of words) using a vocabulary ing the size of the vector to be the dimensions of the feature
containing millions of words [Collobert and Weston, 2008; space used to represent a single word.
Bengio and Usunier, 2011; Socher et al., 2011; Glorot et al., For example, creating a 5-dimensional representation of
2011; Turney and Pantel, 2010; Turney, 2013]. prescriptions such as Aspirin, Ibuprofen, and Insulin, we
Recently, [Mikolov et al., 2013] has introduced a neural could decide on features such as “Heart problems,” “Pain
network design using distributed word representations to cap- killer,” “Kidney Problems,” “Critical Importance medication”
ture interesting features such as linguistic regularities and and “Preventative treatment.” In this example, Aspirin would
patterns. The architecture, named the Skip-Gram model, is rank moderately for “Heart,” quite highly “Pain killer,” rel-
trained to find word representations of a given (input) word atively lowly for “Kidney Problems,” perhaps low to mod-
that are useful in predicting its surrounding words in a sen- erately for ”Critical Importance” and moderately to high on
tence or a document. The vector representation used in the “Preventative.” Normalizing the values of an arbitrary patient
Skip-Gram model highly increases the network’s training ef- (by assuming a vector length of unity) gives the vector shown
ficiency, with the ability to train 100 billion words in single in Table 1.
optimized machine [Mikolov et al., 2013]. In practice, we do not suggest the nature of each feature,
Our idea lies in using a Skip-Gram model to learn a com- but merely supply the number of them - a neural network or
pact representation of patient features. Using an initial model another approach then learns these features so as to serve its
with medications and diagnoses as features, we propose a needs best. However, the basic premise is the same - each
scheme to embed top-level ICD 9 codes of patient prescrip- feature has some meaning in the hypothetical latent space
tions and diagnoses within the same continuous representa- learned by the network, and so similar values in the same
tional space. We then build a skip-gram representation using position indicate two samples both share some aspect of this
30
Drug Heart Pain killer Kidney Problem Critical Importance Preventative
Aspirin 0.57 0.74 0.04 0.12 0.33
Insulin 0.07 0.07 0.64 0.54 0.54
Ibuprofen 0.1 0.99 0.05 0.05 0.05
Table 1: Example of normalized manual encoding
feature. Examples which share a large number of features are 3.1 The Skip-Gram Model
therefore closer than those which share only a few, as a con- The details of our implementation are largely based on the
sequence of this encoding, which is the mechanism by which skip-gram model [Mikolov et al., 2013; Rumelhart et al.,
similarity is explicitly encoded as Table 2 shows. 1988] and is shown in Figure 1. The implemented logistic
regression classifier receives as input an ID corresponding to
Drug Pair Similarity an item in our vocabulary (in this case a list of all the ICD 9
Aspirin - Insulin 0.36 codes for diagnoses and Medications). This ID corresponds
Aspirin - Ibuprofen 0.82 to the Drug Embedding, which is a row within our Embed-
ding matrix. Using the embedding lookup(...) functional-
Table 2: Example of embedding similarity ity in tensorflow, we retrieve the 100-dimensional Embedding
for the input, multiply it by a weight vector and pass through
a softmax function. The input-output pairs are created to be
all permutations of pairs that appear together in the same set.
2.2 Learning via the Skip-Gram Model For example, if a patient was prescribed medications A, B,
and diagnosis D, we create input-output pairs as: (A, B), (A,
The Skip-Gram model is based on the goal of finding word
D), (B, A), (B, D), (D, A). We aggregate all these input-output
representations that would enable the prediction of surround-
pairs across all patients in the training set and use them to per-
ing words of a given word in a sentence. The idea is for any
form mini-batch back-propagation on the embedding matrix
’candidate’ word found in the training vocabulary; we can as-
and logistic regression parameters simultaneously. As pro-
sociate the most likely ’context’ word such that the two words
posed by Mikolov et al. [Mikolov et al., 2013], we use Noise
show the maximum association.
Contrastive Estimation to approximate the loss at each step of
Formally, given a sequence of words training, to improve the efficiency of computation, and built
w1 , w2 , ..., wn , ..., wN , the Skip-Gram model will train our model in tensorflow.
a multi-class logistic regression so that for each candidate
word wn , we can find a ’context word’ wn + j falling 3.2 Patient Similarity Using Unsupervised
within the window of c words before or after wn such that Embeddings
the probability of P (wn+j |wn ) is maximum [Mikolov et
Using the unsupervised joint embeddings, we show that
al., 2013]. In other words, the Skip-Gram model aims to
meaningful patient similarities can be discovered within the
maximum the average log probability:
data. To do this, we train the prescription and diagnosis joint
N embeddings in the manner described in the previous section,
1 X X
on a subsection of the data (100,000) prescriptions, and then
logP (wn+j |wn )
N n=1 draw patients randomly from the remaining portion of the
cjc,j6=0
data. We then aggregate all the prescriptions given on a daily
c is the size of the training context and is used to adjust basis to the patient during their stay and replace each one with
the model. Larger c values associate a wider context with the relevant embedding trained previously.
a given candidate word, implying more training examples, To generate a treatment vector, we average all the individ-
slower training but better classification. ual drug embeddings for each day. By then taking a single
days treatment vector, and computing the cosine similarity
between that and other daily treatment vectors, we can find
3 Our Work: A Patient-Focused Skip-Gram the similarity between patient treatments.
Model This is a well-established trick in NLP and is often a
primary benchmark to compare other methods against, and
The work performed here is based on the idea of generalising while it may not seem like the most sophisticated solution, it
vector-based embeddings to any number of medical concepts, can be surprisingly effective.
regardless of whether or not they come from the same under-
lying distribution. The main implication of this is that the 4 Experiments & Results
features potentially become more general or invisible to us.
However, with related domains such as diseases and drugs, 4.1 Data Source & Preprocessing
we could imagine a normalized encoding as given in figure 1 The model was trained using the MIMIC dataset [Johnson et
for features spanning the two concepts of disease and medi- al., 2016]. This is a large Intensive Care Unit (ICU) dataset
cation. containing the records of over 40,000 patients in the ICU
31
Figure 1: The Skip-Gram Architecture Used
of the Beth Israel Deaconess Medical Center, Boston, Mas- standard Adam as the optimization method and negative log
sachusetts, U.S.A. between 2005 and 2012. Prescriptions are likelihood for the loss function.
registered alongside a unique and anonymized patient iden- In tensorflow, we initialized the weights randomly, with a
tifier, with a date range indicating the period this was to be truncated normal distribution for weights and a random uni-
administered over. form for embeddings and biases. This is based upon conven-
To train a neural network on patient prescriptions, one must tional methodologies found to be most useful in a wide range
first extract the data and reshape it, which is a non-trivial task of settings, as described in [LeCun et al., 2012].
for the way the data is presented in MIMIC iii. As shown in Following from [Mikolov et al., 2013], I use Noise con-
Figure 2, prescriptions are primarily indicated by a combina- trastive estimation to improve the efficiency of the model. As
tion of hospital admission id, start date, end date and drug. the model has many outputs (one for each entry in the ’vo-
cabulary’), computing the softmax at each stage is compu-
4.2 Tensorflow Implementation tationally expensive. As most of the entries are in fact not
The first step was to aggregate all the drugs by day and hos- relevant (we have many classes, but most should be 0, and
pital admission id, to compile a list of concepts to be used per we want only a single entry that is substantially non-zero),
day, as shown in Figure 3. Each day defines a context window we can improve the computation efficiency by sampling the
for that patient, so if a patient received drugs A, B and C on loss function rather than computing it exhaustively. There are
a given day, the input output pairs for the network are (A, B), two ways to achieve this in practice, one is with a sampled
(A, C), (B, A), (B, C), (C, A) and (C, B). softmax, which essentially computes a Monte Carlo estimate,
Next, we assign each concept an arbitrary ID, with 0 re- and Noise contrastive estimation which picks examples of the
served for an ’unknown’ entry. This allows unseen concepts positive and negative classes so as to get an estimate that way.
to be included after training time. Each ID maps to a row in
a randomly initialized embedding matrix, which has dimen- 4.3 Results
sions (number of drugs x embedding size). This embedding Evaluating Prescription Embeddings
matrix is then used as inputs to logistic regression classifier, As this is an unsupervised approach, quantitative evaluation
which performs a one-hot prediction for the output concept, of the results is difficult. To assess if the neighbourhoods
with size (number of drugs,). This is displayed mathemati- are correct, most previous work either appeals to experts to
cally in 2. evaluate the quality or avoids this altogether and leaves the
reader to judge for themselves [Mikolov et al., 2013].
E = embedding lookup(X) (1) To provide a qualitative evaluation of the results, we took
the top occurring drugs and found the nearest neighbours to
ŷ = sof tmax(E · W + b) (2)
them using cosine similarity as a measure.
This system is trained via back propagation, and it simul- These nearest neighbour relationships show some useful
taneously learns both the W and b parameters and the values similarity between drugs. For example, we see salts and elec-
of the embedding matrix. Once training is complete, the em- trolytes naturally grouping together (e.g. Potassium Chloride
bedding matrix acts as a lookup dictionary - to get the repre- and Magnesium Sulfate). Aspirin is close to two statins -
sentation for a particular drug, simply find the ID it maps to drugs which try to treat blood pressure and alleviate the risks
and extracts this row from the embedding matrix. I used the of heart attack or similar problems. Metoclopramide is used
32
Figure 2: Example format of prescriptions
Drug Nearest Neighbour 2nd Nearest Neighbour
Potassium Chloride Magnesium Sulfate Calcium Gluconate
Morphine Sulfate Acetaminophen Oxycodone-Acetaminophen
Docusate Sodium Sodium Chloride 0.9% Flush Acetaminophen
Calcium Gluconate Potassium Chloride Magnesium Sulfate
Aspirin Simvastatin Atorvastatin
Metoclopramide Ranitidine Nitroglycerin
Amiodarone HCl D5W (EXCEL BAG) Phenylephrine HCl
Heparin Sodium Warfarin Ibuprofen
Table 3: Nearest Neighbours for Drug Embeddings
embeddings encodes the same relevant information seen in
the results for single embeddings, while also providing links
between diagnoses codes and drugs. For more broad rang-
ing drugs, such as painkillers, we see a clustering that is
not particularly associated with a single ICD9 code, for ex-
ample, Bisacodyl is close to Docusate Sodium and Mor-
phine Sulphate. This also shows another interesting artifact
of this method - docusate sodium is not a painkiller, but is
’close’ to bisacodyl because they often appear together. Ac-
Figure 3: Aggregated daily prescriptions etaminophen, Meperidine, and Morphine Sulfate are another
cluster of pain relief medications which do not appear ’close’
to a particular ICD9 diagnosis code.
to treat acid reflux, a stomach complaint, and Ranitidine is We see interesting clustering of ICD9 codes - 427, 428 and
used to reduce the amount of stomach acid produced. 414 all representing heart problems for example. We also see
We also see relationships between items that often appear cross group clusters, which put Diabetes and Insulin close
together even if they are not direct replacements. For exam- together, as well as Aspirin and heart disease.
ple, Amiodarone HCL is an antiarrhythmic drug, used to treat
issues with irregular heartbeats, and its nearest neighbour is Evaluating Patient Similarity
D5W. D5W is a code for Dextrose 5% and water, which is Finding patients who shared a similar daily treatment vec-
essentially just a carrier for IV lines and similar methods of tors worked well to find patients of similar types. Due to
delivery. These two are near as it is common within the data the nature of the ICU, many patients received a large number
to administer Amiodarone HCL as a solution with D5W. of drugs, and using embeddings rather than a one hot style
approach allows for meaningful entries to be more discrim-
Joint Embeddings inative. We selected patients at random, and then picked a
As with the prescription only embeddings, proving these en- random day for that patient, and computed the cosine simi-
code useful information in a quantitative way is somewhat larity between that daily treatment vector and all other daily
complicated. We follow the same approach as the previous treatment vectors for all patients. As expected, other days
section and provide some of the nearest neighbours for com- from that patients stay in the ICU rank very highly in many
mon entries in the data, and also, in the next section, show cases. However, even if we look only at other patients, we
that these embeddings are useful for the task of finding pa- see meaningful groupings occurring. Some examples are in-
tients with similar treatments, as a way to demonstrate that cluded in Table 5. Similarities are Cosine similarities of nor-
they encode relevant information. malised vectors, and so they vary between 100% and -100%.
As can be seen in table 4, the approach of using joint A similarity of 100% means the same, while -100% indicates
33
Entry Nearest Neighbour 2nd Nearest Neighbour
Bisacodyl Docusate Sodium Morphine Sulfate
Calcium Gluconate SW D5W
Acetaminophen Meperidine Morphine Sulfate
Insulin 250 (Diabetes mellitus) Tamsulosin HCl
427 (Cardiac dysrhythmias) 428 (Congestive heart failure) 414 (Other forms of
chronic ischemic heart disease)
276 (Disorders of fluid electrolyte) 530 (Diseases of esophagus) 790 (Nonspecific findings on
examination of blood)
401 (Essential hypertension) 746 (Other congenital 272 (Disorders of
anomalies of heart) lipoid metabolism)
Aspirin Clopidogrel Bisulfate 414 (Other forms of
chronic ischemic heart disease)
Pantoprazole Sodium Iso-Osmotic Sodium Chloride Magnesium Sulfate
Morphine Sulfate Acetaminophen Oxycodone-Acetaminophen
Heparin Guaifenesin Senna
Lorazepam Chlorhexidine Gluconate Diphenhydramine HCl
Metoprolol Cefazolin Warfarin
250 (Diabetes mellitus) Insulin 414 (Other forms of
chronic ischemic heart disease)
Table 4: Nearest Neighbours for Drug & Diagnosis Embeddings
complete opposites. As such, we can show example pairs 6 Acknowledgements
which are similar, and those which are also not similar.
The authors would like to acknowledge the supports of the
As shown in Figure 4, if we conduct PCA on the 100- NIHR Biomedical Research Centre for Mental Health, the
dimensional embeddings to reduce the dimensionality down Biomedical Research Unit for Dementia at the South Lon-
to two principle components, we can visualise the ’closeness’ don, the Maudsley NHS Foundation Trust and Kings College
of a small number of entries. The figure shows that the se- London, and a joint infrastructure grant from Guys and St
lected entries fall into 3 or 4 clusters. Diabetes is close to Thomas Charity and the Maudsley Charity, London, United
Insulin, which two neonatal drugs (denoted by NEO*) are Kingdom.
close to the diagnosis ICD 9 code for electrolyte imbalance This research was also supported by researchers at the Na-
- a condition that strongly associates with newborns in the tional Institute for Health Research University College Lon-
dataset. The other cluster corresponds to pain medication don Hospitals Biomedical Research Centre, and by awards
(and docusate sodium which is used to alleviate constipation, establishing the Farr Institute of Health Informatics Research
a common side effect of many pain medications). This cluster at UCLPartners, from the Medical Research Council, Arthri-
seems to split into two subclusters, potentially corresponding tis Research UK, British Heart Foundation, Cancer Research
to differing use of the different drugs, but likely an artifact of UK, Chief Scientist Office, Economic and Social Research
the dimensionality reduction. Council, Engineering and Physical Sciences Research Coun-
cil, National Institute for Health Research, National Institute
for Social Care and Health Research, and Wellcome Trust.
5 Conclusions and Ongoing Work
References
The idea and initial results presented here are part of our on- [Bengio and Usunier, 2011] Jason Weston Samy Bengio and
going work of finding compact representations of patients Nicolas Usunier. Wsabie: Scaling up to large vocabulary
using their electronic hospital records, to make use of the image annotation. In International joint conference on Ar-
massive deluge of longitudinal data available to understand tificial Intelligence (IJCAI), page 27642770, 2011.
the densest set of features representing a given patient. Our [Brants et al., 2007] T Brants, C Popat, F Xu, J Och, and
long-term goal is to complement endeavors in personalized
J Dean. Large language models in machine translation.
medicine by devising patient similarity measures that can be
In the Joint Conference on Empirical Methods in Nat-
used to supplement scientific inquiry and which can translate
ural Language Processing and Computational Language
into actionable knowledge in the bedside. For instance, we
Learning, 2007.
would like to be able to answer questions such as Given that
patient X is similar to patient Y, will X respond to treatment [Collobert and Weston, 2008] Ronan Collobert and Jason
Z similarly to patient Y? Or why has patient A responded dif- Weston. A unified architecture for natural language pro-
ferently than patient B to the same treatment provided? cessing: deep neural networks with multitask learning.
34
Figure 4: PCA on selected embeddings
Patient 1 Diagnoses Patient 2 Diagnoses Treatment Similarity
Newborn Newborn 99%
Stab Wounds Motor Vehicle Accident 89%
Pneumonia Hypoxia 85%
Diaherria Fever 85%
Urinary Tract Infection Renal Failure 78%
Encephalopathy Congestive Heart Failure -41%
Incarcerated Hernia Sepsis -40%
Newborn Arterial Injury -35%
Table 5: Similarity between selected daily treatment vectors
In International conference on Machine learning, page ity. In Advances in neural information processing systems,
160167, 2008. pages 3111–3119, 2013.
[Glorot et al., 2011] Xavier Glorot, Antoine Bordes, and [Rumelhart et al., 1988] David E Rumelhart, Geoffrey E
Yoshua Bengio. Domain adaptation for large-scale sen- Hinton, and Ronald J Williams. Learning representations
timent classification: A deep learning approach. In Inter- by back-propagating errors. Cognitive modeling, 5(3):1,
national Conference on Machine Learning (ICML), page 1988.
513520, 2011. [Socher et al., 2011] Richard Socher, Cliff Lin, Andrew Ng,
and Christopher Manning. Parsing natural scenes and nat-
[Johnson et al., 2016] Alistair EW Johnson, Tom J Pollard,
ural language with recursive neural networks. In Interna-
Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad tional Conference on Machine Learning (ICML), 2011.
Ghassemi, Benjamin Moody, Peter Szolovits, Leo An-
thony Celi, and Roger G Mark. Mimic-iii, a freely ac- [Turney and Pantel, 2010] Peter D. Turney and Patrick Pan-
cessible critical care database. Scientific data, 3, 2016. tel. From frequency to meaning: Vector space models
of semantics. Journal of Artificial Intelligence Research,
[LeCun et al., 2012] Yann A LeCun, Léon Bottou, 37:141–188, 2010.
Genevieve B Orr, and Klaus-Robert Müller. Effi- [Turney, 2013] Peter Turney. Distributional semantics be-
cient backprop. In Neural networks: Tricks of the trade,
yond words: Supervised learning of analogy and para-
pages 9–48. Springer, 2012.
phrase. Transactions of the Association for Computational
[Mikolov et al., 2013] Tomas Mikolov, Ilya Sutskever, Kai Linguistics (TACL), page 353366, 2013.
Chen, Greg S Corrado, and Jeff Dean. Distributed repre-
sentations of words and phrases and their compositional-
35