Learning Patient Similarity Using Joint Distributed Embeddings of Treatment and Diagnoses Christopher Ormandy Zina M. Ibrahim Richard JB Dobson Department of Informatics Department of Biostatistics Department of Biostatistics King’s College London & Health Informatics & Health Informatics christopher.ormandy@kcl.ac.uk King’s College London King’s College London zina.ibrahim@kcl.ac.uk richard.j.dobson@kcl.ac.uk Abstract the chosen system to create a compact and continuous rep- resentation of patients enabling: 1) efficient feature process- We propose the use of vector-based word embed- ing and 2) some degree of generalization in finding similar- ding models to learn a cross-conceptual represen- ity between patients given their features. Using our model, tation of medical vocabulary. The learned model we would be able to reach the natural conclusion of a patient is dense and encodes useful knowledge from the diagnosed with Diabetes being similar to a patient receiving training concepts. Applying the embedding to the insulin treatment. This is a non-trivial exercise for a machine concepts of diagnoses and medications, we then learning algorithm, as we understand that the two cases are to show that they can then be used to measure similari- some degree the same abstract concept expressed across two ties among patient prescriptions, leading to the dis- different domains (diagnoses vs. treatment). covery of in- formative and intuitive relationships The paper is structured as follows. After a brief illustration between patients. of the required background in Section 2, we discuss our archi- tecture in Section 3. In Section 4, we show the results of train- 1 Introduction ing the resulting neural network model on a large database of In simple word representation techniques such as the Ngram intensive care unit medical records. We conclude with ongo- model [Brants et al., 2007], words are regarded as single ing work and future directions in Section 5. atomic units, and no notion of similarity between words ex- ists. Conversely, distributed word representations in vec- 2 Background tor space provide an explicit grouping of similar words to achieve high performance in Natural Language Processing 2.1 Vector-based Word Representation tasks [Rumelhart et al., 1988]. Such embeddings rely on vec- A well-established approach for representing concepts to fa- tor operations to represent learned word proximities or sim- cilitate learning is the use of a fixed dimension, real valued ilarities [Mikolov et al., 2013] and have have been used to vector representing words. Each entry of this vector corre- efficiently learn high-quality word vectors from very large sponds to some feature in a hypothetical latent space, render- datasets (containing billions of words) using a vocabulary ing the size of the vector to be the dimensions of the feature containing millions of words [Collobert and Weston, 2008; space used to represent a single word. Bengio and Usunier, 2011; Socher et al., 2011; Glorot et al., For example, creating a 5-dimensional representation of 2011; Turney and Pantel, 2010; Turney, 2013]. prescriptions such as Aspirin, Ibuprofen, and Insulin, we Recently, [Mikolov et al., 2013] has introduced a neural could decide on features such as “Heart problems,” “Pain network design using distributed word representations to cap- killer,” “Kidney Problems,” “Critical Importance medication” ture interesting features such as linguistic regularities and and “Preventative treatment.” In this example, Aspirin would patterns. The architecture, named the Skip-Gram model, is rank moderately for “Heart,” quite highly “Pain killer,” rel- trained to find word representations of a given (input) word atively lowly for “Kidney Problems,” perhaps low to mod- that are useful in predicting its surrounding words in a sen- erately for ”Critical Importance” and moderately to high on tence or a document. The vector representation used in the “Preventative.” Normalizing the values of an arbitrary patient Skip-Gram model highly increases the network’s training ef- (by assuming a vector length of unity) gives the vector shown ficiency, with the ability to train 100 billion words in single in Table 1. optimized machine [Mikolov et al., 2013]. In practice, we do not suggest the nature of each feature, Our idea lies in using a Skip-Gram model to learn a com- but merely supply the number of them - a neural network or pact representation of patient features. Using an initial model another approach then learns these features so as to serve its with medications and diagnoses as features, we propose a needs best. However, the basic premise is the same - each scheme to embed top-level ICD 9 codes of patient prescrip- feature has some meaning in the hypothetical latent space tions and diagnoses within the same continuous representa- learned by the network, and so similar values in the same tional space. We then build a skip-gram representation using position indicate two samples both share some aspect of this 30 Drug Heart Pain killer Kidney Problem Critical Importance Preventative Aspirin 0.57 0.74 0.04 0.12 0.33 Insulin 0.07 0.07 0.64 0.54 0.54 Ibuprofen 0.1 0.99 0.05 0.05 0.05 Table 1: Example of normalized manual encoding feature. Examples which share a large number of features are 3.1 The Skip-Gram Model therefore closer than those which share only a few, as a con- The details of our implementation are largely based on the sequence of this encoding, which is the mechanism by which skip-gram model [Mikolov et al., 2013; Rumelhart et al., similarity is explicitly encoded as Table 2 shows. 1988] and is shown in Figure 1. The implemented logistic regression classifier receives as input an ID corresponding to Drug Pair Similarity an item in our vocabulary (in this case a list of all the ICD 9 Aspirin - Insulin 0.36 codes for diagnoses and Medications). This ID corresponds Aspirin - Ibuprofen 0.82 to the Drug Embedding, which is a row within our Embed- ding matrix. Using the embedding lookup(...) functional- Table 2: Example of embedding similarity ity in tensorflow, we retrieve the 100-dimensional Embedding for the input, multiply it by a weight vector and pass through a softmax function. The input-output pairs are created to be all permutations of pairs that appear together in the same set. 2.2 Learning via the Skip-Gram Model For example, if a patient was prescribed medications A, B, and diagnosis D, we create input-output pairs as: (A, B), (A, The Skip-Gram model is based on the goal of finding word D), (B, A), (B, D), (D, A). We aggregate all these input-output representations that would enable the prediction of surround- pairs across all patients in the training set and use them to per- ing words of a given word in a sentence. The idea is for any form mini-batch back-propagation on the embedding matrix ’candidate’ word found in the training vocabulary; we can as- and logistic regression parameters simultaneously. As pro- sociate the most likely ’context’ word such that the two words posed by Mikolov et al. [Mikolov et al., 2013], we use Noise show the maximum association. Contrastive Estimation to approximate the loss at each step of Formally, given a sequence of words training, to improve the efficiency of computation, and built w1 , w2 , ..., wn , ..., wN , the Skip-Gram model will train our model in tensorflow. a multi-class logistic regression so that for each candidate word wn , we can find a ’context word’ wn + j falling 3.2 Patient Similarity Using Unsupervised within the window of c words before or after wn such that Embeddings the probability of P (wn+j |wn ) is maximum [Mikolov et Using the unsupervised joint embeddings, we show that al., 2013]. In other words, the Skip-Gram model aims to meaningful patient similarities can be discovered within the maximum the average log probability: data. To do this, we train the prescription and diagnosis joint N embeddings in the manner described in the previous section, 1 X X on a subsection of the data (100,000) prescriptions, and then logP (wn+j |wn ) N n=1 draw patients randomly from the remaining portion of the cjc,j6=0 data. We then aggregate all the prescriptions given on a daily c is the size of the training context and is used to adjust basis to the patient during their stay and replace each one with the model. Larger c values associate a wider context with the relevant embedding trained previously. a given candidate word, implying more training examples, To generate a treatment vector, we average all the individ- slower training but better classification. ual drug embeddings for each day. By then taking a single days treatment vector, and computing the cosine similarity between that and other daily treatment vectors, we can find 3 Our Work: A Patient-Focused Skip-Gram the similarity between patient treatments. Model This is a well-established trick in NLP and is often a primary benchmark to compare other methods against, and The work performed here is based on the idea of generalising while it may not seem like the most sophisticated solution, it vector-based embeddings to any number of medical concepts, can be surprisingly effective. regardless of whether or not they come from the same under- lying distribution. The main implication of this is that the 4 Experiments & Results features potentially become more general or invisible to us. However, with related domains such as diseases and drugs, 4.1 Data Source & Preprocessing we could imagine a normalized encoding as given in figure 1 The model was trained using the MIMIC dataset [Johnson et for features spanning the two concepts of disease and medi- al., 2016]. This is a large Intensive Care Unit (ICU) dataset cation. containing the records of over 40,000 patients in the ICU 31 Figure 1: The Skip-Gram Architecture Used of the Beth Israel Deaconess Medical Center, Boston, Mas- standard Adam as the optimization method and negative log sachusetts, U.S.A. between 2005 and 2012. Prescriptions are likelihood for the loss function. registered alongside a unique and anonymized patient iden- In tensorflow, we initialized the weights randomly, with a tifier, with a date range indicating the period this was to be truncated normal distribution for weights and a random uni- administered over. form for embeddings and biases. This is based upon conven- To train a neural network on patient prescriptions, one must tional methodologies found to be most useful in a wide range first extract the data and reshape it, which is a non-trivial task of settings, as described in [LeCun et al., 2012]. for the way the data is presented in MIMIC iii. As shown in Following from [Mikolov et al., 2013], I use Noise con- Figure 2, prescriptions are primarily indicated by a combina- trastive estimation to improve the efficiency of the model. As tion of hospital admission id, start date, end date and drug. the model has many outputs (one for each entry in the ’vo- cabulary’), computing the softmax at each stage is compu- 4.2 Tensorflow Implementation tationally expensive. As most of the entries are in fact not The first step was to aggregate all the drugs by day and hos- relevant (we have many classes, but most should be 0, and pital admission id, to compile a list of concepts to be used per we want only a single entry that is substantially non-zero), day, as shown in Figure 3. Each day defines a context window we can improve the computation efficiency by sampling the for that patient, so if a patient received drugs A, B and C on loss function rather than computing it exhaustively. There are a given day, the input output pairs for the network are (A, B), two ways to achieve this in practice, one is with a sampled (A, C), (B, A), (B, C), (C, A) and (C, B). softmax, which essentially computes a Monte Carlo estimate, Next, we assign each concept an arbitrary ID, with 0 re- and Noise contrastive estimation which picks examples of the served for an ’unknown’ entry. This allows unseen concepts positive and negative classes so as to get an estimate that way. to be included after training time. Each ID maps to a row in a randomly initialized embedding matrix, which has dimen- 4.3 Results sions (number of drugs x embedding size). This embedding Evaluating Prescription Embeddings matrix is then used as inputs to logistic regression classifier, As this is an unsupervised approach, quantitative evaluation which performs a one-hot prediction for the output concept, of the results is difficult. To assess if the neighbourhoods with size (number of drugs,). This is displayed mathemati- are correct, most previous work either appeals to experts to cally in 2. evaluate the quality or avoids this altogether and leaves the reader to judge for themselves [Mikolov et al., 2013]. E = embedding lookup(X) (1) To provide a qualitative evaluation of the results, we took the top occurring drugs and found the nearest neighbours to ŷ = sof tmax(E · W + b) (2) them using cosine similarity as a measure. This system is trained via back propagation, and it simul- These nearest neighbour relationships show some useful taneously learns both the W and b parameters and the values similarity between drugs. For example, we see salts and elec- of the embedding matrix. Once training is complete, the em- trolytes naturally grouping together (e.g. Potassium Chloride bedding matrix acts as a lookup dictionary - to get the repre- and Magnesium Sulfate). Aspirin is close to two statins - sentation for a particular drug, simply find the ID it maps to drugs which try to treat blood pressure and alleviate the risks and extracts this row from the embedding matrix. I used the of heart attack or similar problems. Metoclopramide is used 32 Figure 2: Example format of prescriptions Drug Nearest Neighbour 2nd Nearest Neighbour Potassium Chloride Magnesium Sulfate Calcium Gluconate Morphine Sulfate Acetaminophen Oxycodone-Acetaminophen Docusate Sodium Sodium Chloride 0.9% Flush Acetaminophen Calcium Gluconate Potassium Chloride Magnesium Sulfate Aspirin Simvastatin Atorvastatin Metoclopramide Ranitidine Nitroglycerin Amiodarone HCl D5W (EXCEL BAG) Phenylephrine HCl Heparin Sodium Warfarin Ibuprofen Table 3: Nearest Neighbours for Drug Embeddings embeddings encodes the same relevant information seen in the results for single embeddings, while also providing links between diagnoses codes and drugs. For more broad rang- ing drugs, such as painkillers, we see a clustering that is not particularly associated with a single ICD9 code, for ex- ample, Bisacodyl is close to Docusate Sodium and Mor- phine Sulphate. This also shows another interesting artifact of this method - docusate sodium is not a painkiller, but is ’close’ to bisacodyl because they often appear together. Ac- Figure 3: Aggregated daily prescriptions etaminophen, Meperidine, and Morphine Sulfate are another cluster of pain relief medications which do not appear ’close’ to a particular ICD9 diagnosis code. to treat acid reflux, a stomach complaint, and Ranitidine is We see interesting clustering of ICD9 codes - 427, 428 and used to reduce the amount of stomach acid produced. 414 all representing heart problems for example. We also see We also see relationships between items that often appear cross group clusters, which put Diabetes and Insulin close together even if they are not direct replacements. For exam- together, as well as Aspirin and heart disease. ple, Amiodarone HCL is an antiarrhythmic drug, used to treat issues with irregular heartbeats, and its nearest neighbour is Evaluating Patient Similarity D5W. D5W is a code for Dextrose 5% and water, which is Finding patients who shared a similar daily treatment vec- essentially just a carrier for IV lines and similar methods of tors worked well to find patients of similar types. Due to delivery. These two are near as it is common within the data the nature of the ICU, many patients received a large number to administer Amiodarone HCL as a solution with D5W. of drugs, and using embeddings rather than a one hot style approach allows for meaningful entries to be more discrim- Joint Embeddings inative. We selected patients at random, and then picked a As with the prescription only embeddings, proving these en- random day for that patient, and computed the cosine simi- code useful information in a quantitative way is somewhat larity between that daily treatment vector and all other daily complicated. We follow the same approach as the previous treatment vectors for all patients. As expected, other days section and provide some of the nearest neighbours for com- from that patients stay in the ICU rank very highly in many mon entries in the data, and also, in the next section, show cases. However, even if we look only at other patients, we that these embeddings are useful for the task of finding pa- see meaningful groupings occurring. Some examples are in- tients with similar treatments, as a way to demonstrate that cluded in Table 5. Similarities are Cosine similarities of nor- they encode relevant information. malised vectors, and so they vary between 100% and -100%. As can be seen in table 4, the approach of using joint A similarity of 100% means the same, while -100% indicates 33 Entry Nearest Neighbour 2nd Nearest Neighbour Bisacodyl Docusate Sodium Morphine Sulfate Calcium Gluconate SW D5W Acetaminophen Meperidine Morphine Sulfate Insulin 250 (Diabetes mellitus) Tamsulosin HCl 427 (Cardiac dysrhythmias) 428 (Congestive heart failure) 414 (Other forms of chronic ischemic heart disease) 276 (Disorders of fluid electrolyte) 530 (Diseases of esophagus) 790 (Nonspecific findings on examination of blood) 401 (Essential hypertension) 746 (Other congenital 272 (Disorders of anomalies of heart) lipoid metabolism) Aspirin Clopidogrel Bisulfate 414 (Other forms of chronic ischemic heart disease) Pantoprazole Sodium Iso-Osmotic Sodium Chloride Magnesium Sulfate Morphine Sulfate Acetaminophen Oxycodone-Acetaminophen Heparin Guaifenesin Senna Lorazepam Chlorhexidine Gluconate Diphenhydramine HCl Metoprolol Cefazolin Warfarin 250 (Diabetes mellitus) Insulin 414 (Other forms of chronic ischemic heart disease) Table 4: Nearest Neighbours for Drug & Diagnosis Embeddings complete opposites. As such, we can show example pairs 6 Acknowledgements which are similar, and those which are also not similar. The authors would like to acknowledge the supports of the As shown in Figure 4, if we conduct PCA on the 100- NIHR Biomedical Research Centre for Mental Health, the dimensional embeddings to reduce the dimensionality down Biomedical Research Unit for Dementia at the South Lon- to two principle components, we can visualise the ’closeness’ don, the Maudsley NHS Foundation Trust and Kings College of a small number of entries. The figure shows that the se- London, and a joint infrastructure grant from Guys and St lected entries fall into 3 or 4 clusters. Diabetes is close to Thomas Charity and the Maudsley Charity, London, United Insulin, which two neonatal drugs (denoted by NEO*) are Kingdom. close to the diagnosis ICD 9 code for electrolyte imbalance This research was also supported by researchers at the Na- - a condition that strongly associates with newborns in the tional Institute for Health Research University College Lon- dataset. The other cluster corresponds to pain medication don Hospitals Biomedical Research Centre, and by awards (and docusate sodium which is used to alleviate constipation, establishing the Farr Institute of Health Informatics Research a common side effect of many pain medications). This cluster at UCLPartners, from the Medical Research Council, Arthri- seems to split into two subclusters, potentially corresponding tis Research UK, British Heart Foundation, Cancer Research to differing use of the different drugs, but likely an artifact of UK, Chief Scientist Office, Economic and Social Research the dimensionality reduction. Council, Engineering and Physical Sciences Research Coun- cil, National Institute for Health Research, National Institute for Social Care and Health Research, and Wellcome Trust. 5 Conclusions and Ongoing Work References The idea and initial results presented here are part of our on- [Bengio and Usunier, 2011] Jason Weston Samy Bengio and going work of finding compact representations of patients Nicolas Usunier. Wsabie: Scaling up to large vocabulary using their electronic hospital records, to make use of the image annotation. In International joint conference on Ar- massive deluge of longitudinal data available to understand tificial Intelligence (IJCAI), page 27642770, 2011. the densest set of features representing a given patient. Our [Brants et al., 2007] T Brants, C Popat, F Xu, J Och, and long-term goal is to complement endeavors in personalized J Dean. Large language models in machine translation. medicine by devising patient similarity measures that can be In the Joint Conference on Empirical Methods in Nat- used to supplement scientific inquiry and which can translate ural Language Processing and Computational Language into actionable knowledge in the bedside. For instance, we Learning, 2007. would like to be able to answer questions such as Given that patient X is similar to patient Y, will X respond to treatment [Collobert and Weston, 2008] Ronan Collobert and Jason Z similarly to patient Y? Or why has patient A responded dif- Weston. A unified architecture for natural language pro- ferently than patient B to the same treatment provided? cessing: deep neural networks with multitask learning. 34 Figure 4: PCA on selected embeddings Patient 1 Diagnoses Patient 2 Diagnoses Treatment Similarity Newborn Newborn 99% Stab Wounds Motor Vehicle Accident 89% Pneumonia Hypoxia 85% Diaherria Fever 85% Urinary Tract Infection Renal Failure 78% Encephalopathy Congestive Heart Failure -41% Incarcerated Hernia Sepsis -40% Newborn Arterial Injury -35% Table 5: Similarity between selected daily treatment vectors In International conference on Machine learning, page ity. In Advances in neural information processing systems, 160167, 2008. pages 3111–3119, 2013. [Glorot et al., 2011] Xavier Glorot, Antoine Bordes, and [Rumelhart et al., 1988] David E Rumelhart, Geoffrey E Yoshua Bengio. Domain adaptation for large-scale sen- Hinton, and Ronald J Williams. Learning representations timent classification: A deep learning approach. In Inter- by back-propagating errors. Cognitive modeling, 5(3):1, national Conference on Machine Learning (ICML), page 1988. 513520, 2011. [Socher et al., 2011] Richard Socher, Cliff Lin, Andrew Ng, and Christopher Manning. Parsing natural scenes and nat- [Johnson et al., 2016] Alistair EW Johnson, Tom J Pollard, ural language with recursive neural networks. In Interna- Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad tional Conference on Machine Learning (ICML), 2011. Ghassemi, Benjamin Moody, Peter Szolovits, Leo An- thony Celi, and Roger G Mark. Mimic-iii, a freely ac- [Turney and Pantel, 2010] Peter D. Turney and Patrick Pan- cessible critical care database. Scientific data, 3, 2016. tel. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, [LeCun et al., 2012] Yann A LeCun, Léon Bottou, 37:141–188, 2010. Genevieve B Orr, and Klaus-Robert Müller. Effi- [Turney, 2013] Peter Turney. Distributional semantics be- cient backprop. In Neural networks: Tricks of the trade, yond words: Supervised learning of analogy and para- pages 9–48. Springer, 2012. phrase. Transactions of the Association for Computational [Mikolov et al., 2013] Tomas Mikolov, Ilya Sutskever, Kai Linguistics (TACL), page 353366, 2013. Chen, Greg S Corrado, and Jeff Dean. Distributed repre- sentations of words and phrases and their compositional- 35