Encoding Medical Ontologies With Holographic Reduced Representations for Transformers Bing Hu1,∗ , Trevor Yu1 , Tia Tuinstra1 , Ryan Rezai1 , Harshit Bokadia1 , Rachel DiMaio1 , Thomas Fortin1 , Brian Vartian1,2 and Bryan Tripp1 1 University of Waterloo, Ontario, Canada 2 McMaster University, Ontario, Canada Abstract Transformer models trained on NLP tasks with medical codes often have randomly initialized embeddings that are then adjusted based on training data. For terms appearing infrequently in the dataset, there is little opportunity to improve these representations and learn semantic similarity with other concepts. Medical ontologies represent many biomedical concepts and define a relationship structure between these concepts, making ontologies a valuable source of domain-specific information. Holographic Reduced Representations (HRR) are capable of encoding ontological structure by composing atomic vectors to create structured higher-level concept vectors. We developed an embedding layer that generates concept vectors for clinical diagnostic codes by applying HRR operations that compose atomic vectors based on the SNOMED CT ontology. This approach allows for learning the atomic vectors while maintaining structure in the concept vectors. We trained a Bidirectional Encoder Representations from the Transformers (BERT) model to process sequences of clinical diagnostic codes and used the resulting HRR concept vectors as the embedding matrix for the model. The HRR-based approach introduced interpretable structure into code embeddings while maintaining or modestly improving performance on the masked language modeling (MLM) pre-training task (particularly for rare codes) as well as the fine-tuning tasks of mortality and disease prediction. This approach also better maintains semantic similarity between medically related concept vectors, due to both shared atomic vectors and disentangling of code-frequency information. Keywords Deep Learning, Ontology, Knowledge-Integration 1. Introduction in medical applications [5]. Standard large language mod- els (LLMs) can be prone to biases in the training data, Transformers [1] jointly optimize high-dimensional vec- such as frequency bias, which can result in medical mis- tor embeddings that represent input tokens, and a net- information and potentially clinical harm [6, 7, 8]. work that contextualizes and transforms these embed- Here we use a novel neuro-symbolic medical trans- dings to perform a task. Originally designed for natu- former architecture incorporating structured knowledge ral language processing (NLP) tasks, transformers are from an authoritative medical ontology into the embed- now widely used with other data modalities. In medical dings. Specifically, we use vector-symbolic holographic applications, one important modality consists of medi- reduced representations (HRRs) [9] to produce composite cal codes that are extensively used in electronic health medical-code embeddings and backpropagate through records (EHR). A prominent example in this space is Med- the architecture to optimize the embeddings of atomic BERT [2], which consumes a sequence of diagnosis codes. concepts. This approach produces optimized medical Tasks that Med-BERT and other EHR-transformers per- code embeddings with an explicit structure that incorpo- form include disease and mortality prediction. rates medical knowledge. Deep networks have traditionally been alternatives to We test our method, Holographic Reduced Representa- symbolic artificial intelligence with different advantages tion Bi-directional Encoder Representations from Trans- [3]. Deep networks use real-world data effectively, but formers (HRRBERT), on the Medical Information Mart symbolic approaches have completive properties, such as for Intensive Care (MIMIC)-IV dataset [10] and show im- better transparency and capacity for incorporating struc- provements in both pre-training and fine-tuning tasks. tured information, inspiring many efforts to combine the We also show that our embeddings of ontologically sim- two approaches in neuro-symbolic systems [4]. Addi- ilar rare medical codes have high cosine similarity, in tional transparency and ability to incorporate structured contrast with embeddings that are learned in the stan- information are potential benefits of symbolic approaches dard way. Finally, we investigate learned representations of medical-code frequency, in light of recent demonstra- KiL’24: Workshop on Knowledge-infused Learning co-located with 30th ACM KDD Conference, August 26, 2024, Barcelona, Spain tion of frequency bias in EHR-transformers [6]. ∗ Corresponding author. We contribute: Envelope-Open bingxu.hu@uwaterloo.ca (B. Hu) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License • A novel neuro-symbolic architecture, HRRBERT, Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings that combines vector-symbolic embeddings with upon Smolensky’s by using circular convolution as the the BERT LLM architecture, leading to better per- binding operator [9]. Circular convolution keeps the formance in medical tasks. output in the same dimension, solving the problem of • Efficient construction of vector-symbolic embed- exploding dimensionality. dings that leverage PyTorch autograd on GPUs. In the field of deep learning, HRRs have been used • Optimized medical-code embeddings that better in previous work to recast self-attention for transformer respect semantic similarity of medical terminol- models [18], to improve the efficiency of neural networks ogy than standard embeddings for infrequently performing a multi-label classification task by using an used codes. HRR-based output layer [3], and as a learning model itself with a dynamic encoder that is updated through train- We focus here on processing medical codes, but our ing [19]. In all of these works, the efficiency and simple methods would extend naturally to foundation models arithmetic of HRRs are leveraged. Our work differs in that combine medical codes and natural language. Specif- that we also leverage the ability of HRRs to create struc- ically, the trained atomic vectors of our vector-symbolic tured vectors to represent complex concepts as inputs to embeddings could share a dictionary with language em- a transformer model. beddings, so that training of each could improve the VSAs such as HRRs can effectively encode domain representation of the other. knowledge, including complex concepts and the relation- ships between them. For instance, Nickel et al. [20] pro- 1.1. Background and Related Works pose holographic embeddings that make use of VSA prop- erties to learn and represent knowledge graphs. Encod- The Vector-Symbolic Architectures (VSA) approach is a ing domain knowledge is of interest in the field of deep computing paradigm that relies on high dimensionality learning, as it could improve, for example, a deep neural and randomness to represent concepts as unique vectors network’s ability to leverage human knowledge and to in a high dimensional space [11]. VSAs create and manip- communicate its results within a framework that humans ulate distributed representations of concepts by combin- understand [21]. Ontologies are a form of domain knowl- ing base vectors with bundling, binding, and permutation edge incorporated into machine learning models to use algebraic operators [12]. For example, a scene with a red background knowledge to create embeddings with mean- box and a green ball could be described with the vector ingful similarity metrics and for other purposes [22]. In SCENE=RED⊗BOX+GREEN⊗BALL, where ⊗ indicates our work, we use HRRs to encode domain knowledge binding, and + indicates bundling. The atomic concepts in trainable embeddings for a transformer model. The of RED, GREEN, BOX, and BALL are represented by base domain knowledge we use comes from the Systematized vectors, which are typically random. VSAs also define Nomenclature of Medicine Clinical Terms (SNOMED CT), an inverse operation that allows the decomposition of a which is a widely used clinical ontology system that in- composite representation. For example, the scene rep- cludes definitions of relationships between clinical con- resentation could be queried as SCENE⊗BOX−1 . This cepts [23]. should return the representation of GREEN or an approx- To the best of our knowledge, HRRs have not been used imation of GREEN that is identifiable when compared to before as embeddings for transformer models. Trans- a dictionary. In a VSA, the similarity between concepts former models typically use learned embeddings with can be assessed by measuring the distance between the random initializations [1]. However, in the context of rep- two corresponding vectors. resenting ontological concepts, using such unstructured VSAs were proposed to address challenges in mod- embeddings can have undesirable effects. One problem elling cognition, particularly language [12]. However, is the inconsistency between the rate of co-occurrence VSAs have been successfully applied across a variety of or patterns of occurrence of medical concepts and their domains and modalities outside of the area of language degree of semantic similarity described by the ontology. as well, including in vision [13, 14], biosignal process- For example, the concepts of “Type I Diabetes” and “Type ing [15], and time-series classification [16]. Regardless II Diabetes” are mutually exclusive in EHR data and do of the modality or application, VSAs provide value by not follow the same patterns of occurrence due to dif- enriching vectors with additional information, such as ferences in pathology and patient populations [24]. The spatial semantic information in images and global time differences in occurrence make it difficult for a trans- encoding in time series. former model to learn embeddings with accurate simi- An early VSA framework was Smolensky’s Tensor larity metrics. The concepts should have relatively high Product Representation [17], which addressed the need similarity according to the ontology. They both share a for compositionality, but suffered from exploding model common ancestor of “Diabetes Mellitus,” they are both dimensionality. The VSA framework introduced by Plate, metabolic disorders that affect blood glucose levels, and Holographic Reduced Representations (HRR), improved they can both lead to similar health outcomes. Song et al. [24] seeks to address this type of inconsistency by train- [26], including the atomic vectors for HRR embeddings. ing multiple “multi-sense” embeddings for each non-leaf Fine-tuning used a constant learning rate schedule with a node in an ontology’s knowledge graph via an attention weight decay of 4e-6. Fine-tuning lasted 10 epochs with mechanism. However, the “multi-sense” embeddings do a batch size of 80. not address the learned frequency-related bias that also arises from the co-occurrence of concepts. Frequency- 2.3. Encoding SNOMED Ontology with related bias raises an explainability issue, as it leads to learned embeddings that do not reflect true similarity HRR Embeddings relationships between concepts, for example, as defined In this section, we detail the methodologies of construct- in an ontology, but instead reflect the frequency of the ing vector embeddings for ICD disease codes using HRR concepts in the dataset [6]. This bias particularly affects operations based on the SNOMED CT structured clini- codes that are used less frequently. cal vocabulary. We first describe our mapping from ICD Our proposed approach, HRRBERT, uses the structure concepts to SNOMED CT terms. Next, we define how from SNOMED CT to represent thousands of concepts the atomic symbols present in the SNOMED CT ontology with high-dim-ensional vectors such that each vector are combined using HRR operations to construct con- reflects a particular clinical meaning and can be compared cept vectors for the ICD codes. Finally, we describe our to other vectors using the HRR similarity metric, cosine method to efficiently compute the HRR embedding ma- similarity. It also leverages the computing properties of trix using default PyTorch operations that are compatible HRRs to provide structured embeddings for a LLM that with autograd. supports optimization through backpropagation. 2.3.1. Mapping ICD to SNOMED CT Ontology 2. Methods Our data uses ICD-9 and ICD-10 disease codes while our symbolic ontology is defined in SNOMED CT, so we 2.1. MIMIC-IV Dataset required a mapping from the ICD to the SNOMED CT system to build our symbolic architecture. We used the The data used in this study was derived from the Med- SNOMED CT International Release from May 31, 2022 ical Information Mart for Intensive Care (MIMIC) v2.0 [23] and only included SNOMED CT terms that were database, which is composed of de-identified EHRs from active at the time of that release. While SNOMED pub- in-patient hospital visits between 2008 and 2019 [10]. lishes a mapping tool from SNOMED CT to ICD-10, a MIMIC-IV is available through PhysioNet [25]. We used majority of ICD-10 concepts have one-to-many mappings the ICD-9 and ICD-10 diagnostic codes from the icd_di- in the ICD-to-SNOMED CT direction [27]. To increase agnosis table from the MIMIC-IV hosp module. We fil- the fraction of one-to-one mappings, we used additional tered patients who did not have at least one diagnostic published mappings from the Observational Medical Out- code associated with their records. Sequences of codes comes Partnership (OMOP) [28], mappings from ICD-9 were generated per patient by sorting their hospital visits directly to SNOMED CT [29], and mappings from ICD-10 by time. Within one visit, the order of codes from the to ICD-9 [30]. MIMIC-IV database was used, since it represents the rel- Notably, after excluding ICD codes with no active ative importance of the code for that visit. Each unique SNOMED CT mapping, 671 out of the 26,164 unique code was assigned a token. In total, there were 189,980 ICD codes in the MIMIC-IV dataset were missing map- patient records in the dataset. We used 174,890 patient pings. When those individual codes were removed, a records for pre-training, on which we performed a 90–10 data volume of 4.62% of codes was lost. This removed 58 training-validation split. We reserved 15k records for out of 190,180 patients from the dataset, as they had no fine-tuning tasks. valid ICD codes in their history. Overall, the remaining 25,493 ICD codes mapped to a total of 12,263 SNOMED 2.2. Model Architecture CT terms. We utilized a BERT-base model architecture with a post- layer norm position and a sequence length of 128 ICD 2.3.2. SNOMED CT vector symbolic architecture codes [26]. A custom embedding class was used to sup- Next, we define how the contents of the SNOMED CT port the functionality required for our HRR embeddings. ontology were used to construct a symbolic graph to We adapted the BERT segment embeddings to represent represent ICD concepts. For a given SNOMED CT term, groups of codes from the same hospital visit, using up we used its descriptive words and its relationships to to 100 segment embeddings to encode visit sequencing. other SNOMED CT terms. A relationship is defined by An embedding dimension of 𝑑 = 768 was used, and all a relationship type and a target term. In total, there embeddings were initialized from 𝑥 ∼ 𝒩𝑑 (0, 0.02), as in were 13,852 SNOMED CT target terms and 40 SNOMED autograd to learn through these HRR operations are pro- CT relationship types used to represent all desired ICD vided in Appendix A.1. concepts. In the ontology, many ICD concepts share SNOMED CT terms in their representations. 2.3.3. Embedding Configurations The set of relationships was not necessarily unique for each SNOMED CT term. To add more unique in- We call our method of constructing embeddings for ICD formation, we used a term’s “fully specified name” and codes purely from HRR representations “HRRBase” and any “synonyms” as an additional set of words describing the standard method of creating transformer token em- that term. We set all text to lowercase, stripped punctua- beddings from random vectors “unstructured”. While the tion, and split on spaces to create a vocabulary of words. HRRBase configuration enforces the ontology structure, We removed common English stopwords from a custom we wondered whether it would be too rigid and have dif- stopword list that was collected with assistance from a ficulty representing information not present in SNOMED medical physician. The procedure resulted in a total of CT. As dataset frequency information for ICD medical 8833 vocabulary words. codes is not present in the HRR structure, we tried adding Overall, there were a total of 22,725 “atomic” symbols an embedding that represented the empirical frequency for the VSA which included the SNOMED CT terms, rela- of that ICD code in the dataset. We also tried adding fully tionships, and the description vocabulary. Each symbol learnable embeddings with no prior structure. was assigned an “atomic vector”. We built a “concept vec- Given the wide range of ICD code frequencies in tor” for each of the target 25,493 ICD codes using HRR MIMIC, we log-transformed the empirical ICD code fre- operations to combine atomic vectors according to the quencies, and then discretized the resulting range. For SNOMED CT ontology structure. our HRRFreq configuration, we used the sinusoidal fre- To build a 𝑑-dimensional concept vector for a given quency encoding as in [1] to encode the discretized log- ICD concept, we first considered the set of all relation- frequency information. The frequency embeddings were ships that the concept maps to. We used the HRR opera- normalized before being summed with the HRR embed- tor for binding, circular convolution, to combine vectors ding vectors. representing the relationship type and destination term We defined two additional configurations in which and defined the concept vector to be the bundling of a standard embedding vector was integrated with the these bound relationships. For the description words, structured HRR concept vector. With “HRRAdd”, a learn- we bundled the vectors representing each word together able embedding was added to the concept embedding, and bound this result with a new vector representing the HRRAdd = 𝐶 + 𝐿add , 𝐿add ∈ ℝ𝑁𝑐 ×𝑑 . However, this roughly relationship type “description,” as shown in Equation 1. doubled the number of learnable parameters compared to other formulations. 𝑥ICD concept = ∑ 𝑥rel ⊛𝑥term + ∑ 𝑥desc ⊛𝑥word With “HRRCat”, a learnable embedding of dimension SNOMED CT words 𝑑/2 was concatenated with the HRR concept embed- (1) ding of dimension 𝑑/2. This keeps the total number Formally, let 𝔸 ∶ {1, 2, ..., 𝑁𝑎 } be the set of integers of learnable parameters roughly the same as the unstruc- enumerating the unique atomic symbols for SNOMED tured configuration (25,493 𝑑-dimensional vectors) and CT terms and description words. Let 𝔹 ∶ {1, 2, ..., 𝑁𝑟 } be the HRRBase configuration (22,725 𝑑-dimensional vec- the set of integers enumerating unique relationships for tors). The final embedding matrix was defined as HRRCat SNOMED CT terms, including the description relation- = [𝐶 𝐿cat ], where 𝐶, 𝐿cat ∈ 𝑅𝑁𝑐 ×𝑑/2 . ship and the binding identity. Let 𝔻 ∶ {1, 2, ..., 𝑁𝑐 } be the set of integers enumerating the ICD-9 and ICD-10 disease concepts represented by the VSA. 2.4. Experiments 𝔸 has an associated embedding matrix 𝐴 ∈ ℝ 𝑁 𝑎 ×𝑑 , We pre-trained the unstructured, HRRBase, HRRCat, and where atomic vector 𝑎𝑘 = 𝐴[𝑘,∶] , 𝑘 ∈ 𝔸 is the 𝑘-th row HRRAdd embedding configurations of HRRBERT on the the embedding matrix. Similarly, there is relationship masked language modelling (MLM) task, for 3 trials each. embedding matrix, 𝑅 ∈ ℝ𝑁𝑟 ×𝑑 and 𝑟𝑗 = 𝑅[𝑗,∶] , 𝑗 ∈ 𝔹; For each of the 3 pre-trained models, 10 fine-tuning trials and an ICD concept embedding matrix, 𝐶 ∈ ℝ𝑁𝑐 ×𝑑 and were conducted for a total of 30 trials per fine-tuning task. 𝑐𝑖 = 𝐶[𝑖,∶] , 𝑖 ∈ 𝔻. We described the VSA with the formula The best checkpoint from the 10 epochs of fine-tuning in Equation 2, where 𝒢𝑖 is a graph representing the con- was saved based on validation performance. A test set nections between ICD concept 𝑖 to atomic symbols 𝑘 by containing 666 patient records was used to evaluate each relationship 𝑗. of the fine-tuned models for both mortality and disease 𝑐𝑖 = ∑ 𝑟𝑗 ⊛ 𝑎𝑘 (2) prediction. We report accuracy, precision, recall, and (𝑗,𝑘)∈𝒢𝑖 F1 scores averaged over the 30 trials for the fine-tuning Additional details on how to efficiently use PyTorch tasks. 3. Experimental Results date. A training set of 13k patient records along with a validation set of 2k patient records were used to fine-tune 3.1. Pre-training each model on mortality prediction. Table 1 shows the evaluation results of mortality prediction for each of the configurations. We performed a two-sided Dunnett’s test to compare our multiple experimental HRR embedding configurations to the control unstructured embeddings, with 𝑝 < 0.05 significance level. HRRBase embeddings had a significantly greater mean F1-score (𝑝 = 0.043) and precision (𝑝 = 0.042) compared to unstructured em- beddings. 3.2.2. Disease Prediction Task The disease prediction task is defined as predicting which Figure 1: Pre-training validation set evaluation results fordisease chapters were recorded in the patient’s last visit different configurations using information from earlier visits. We converted all ICD codes in a patient’s last visit into a multi-label bi- MLM accuracy is evaluated on a validation set over the nary vector of disease chapters. As there are 22 disease course of pre-training. Pre-training results for different chapters defined in ICD-10, the multi-label binary vector configurations are shown in Figure 1. The pre-training has a size of 22 with binary values corresponding to the results are averaged over 3 runs for each of the configu- presence of a disease in each chapter. A training set of rations except for HRRFreq where only 1 model run was 4.5k patient records along with a validation set of 500 completed. patient records were used to fine-tune each model on The baseline of learned unstructured embeddings has this task. Table 1 shows the evaluation results of disease a peak pre-training validation performance of around prediction for each of the configurations. For the two- 33.4%. HRRBase embeddings perform around 17% worse sided Dunnett test, Levene’s test shows that the equal compared to the baseline of learned unstructured embed- variance condition is satisfied, and the Shapiro-Wilk test dings. We hypothesize that this decrease in performance suggests normal distributions except for HRRAdd accu- is due to a lack of embedded frequency information in racy. The test showed HRRBase embeddings had a signif- HRRBase compared to learned unstructured embeddings. icantly greater mean accuracy (𝑝 = 0.033) and precision HRRFreq (which combines SNOMED CT information (𝑝 = 0.023) compared to unstructured embeddings. No with frequency information) has a similar performance other comparisons of mean metrics for HRR embeddings compared to unstructured embeddings, supporting this were significantly greater than the control. hypothesis. Compared to baseline, HRRAdd and HRRCat improve pre-training performance by a modest margin of 3.2.3. eICU Mortality Prediction around 2%. We posit that this almost 20% increase in per- An additional experiment conducted on the Philips Elec- formance of HRRCat and HRRAdd over HRRBase during tronic Intensive Care Unit (eICU) [31] shows corrobo- pre-training is partly due to the fully learnable embed- rating results with the MIMIC-IV experiments. For our ding used in HRRCat and HRRAdd learning frequency experiment, we applied our mortality prediction models information. that were fine-tuned on MIMIC-IV to eICU data to see if our results generalize. Table 1 shows that HRRBase 3.2. Fine-tuning embeddings had a significantly greater mean accuracy (𝑝 = 0.046) compared to unstructured embeddings when We fine tuned the networks for mortality prediction and applied to the eICU dataset. These models are not opti- disease prediction. Across metrics and tasks, the best mized for mortality prediction for other hospitals where results were often seen in HRRBase (Table 1) with some coding methodology and clinical practice may differ. For being statistically significant. example, the most common code in the eICU dataset represents acute respiratory failure, whereas the most 3.2.1. Mortality Prediction Task common code in the MIMIC-IV dataset represents hyper- The mortality prediction task is defined as predicting tension. patient mortality within 6 months after the last visit. Bi- nary mortality labels were generated by comparing the time difference between the last visit and the mortality Table 1 Finetuning mean test scores and standard deviations for mortality prediction, disease prediction, eICU mortality prediction, and both Really-Out-Of-Distribution (ROOD) Unseen and Overall disease prediction tasks. The best scores are bolded and are underlined if statistically significant. Finetuning Task Configuration Accuracy Precision Recall F1-Score ROOD HRRBase 94.9±1.0 83.5±4.6 76.8±5.1 79.5±4.9 Unseen Unstructured 92.3±0.3 46.2±0.0 50.0±0.1 48.0±0.1 ROOD HRRBase 81.9±0.1 78.3±0.3 75.2±0.8 76.4±0.5 Overall Unstructured 81.9±0.2 78.7±0.7 74.4±1.2 76.0±0.8 HRRBase 84.4±2.3 65.8±2.0 85.6±2.2 69.2±2.7 Mortality HRRAdd 84.0±2.2 65.7±1.9 85.7±2.3 68.9±2.5 Prediction HRRCat 83.9±2.3 65.6±1.7 84.9±2.8 68.8±2.5 Unstructured 83.4±1.9 64.9±1.2 84.6±2.2 67.9±1.8 HRRBase 79.9±0.5 73.0±1.2 67.2±0.7 69.0±0.6 Disease HRRAdd 79.6±0.7 72.6±1.4 67.3±0.9 69.0±0.6 Prediction HRRCat 79.6±0.8 72.5±1.7 67.3±1.0 68.9±0.8 Unstructured 79.4±0.5 72.1±1.1 67.8±1.0 69.2±0.7 eICU HRRBase 68.9±1.3 75.0±1.8 57.0±5.8 64.5±3.5 Mortality HRRAdd 68.1±1.6 74.0±2.2 56.2±6.8 63.6±3.9 Prediction HRRCat 68.2±1.2 73.8±2.6 57.0±7.2 64.0±3.7 3.2.4. Really-Out-Of-Distribution (ROOD) Disease on other codes. Unstructured embeddings cannot learn Prediction better representations for codes never seen in training. We conducted an additional disease-prediction experi- ment to test generalization to patients with codes outside 3.3. t-SNE of Frequency Bias the training distribution. We found six patients with records that consisted of only 32 codes between them (see list of codes in Appendix A). We created a really- out-of-distribution (ROOD) dataset that consisted of all patients in MIMIC-IV (nearly 30K) with at least one of these codes. We used this as a validation set. The sepa- rate pre-training and fine-tuning dataset did not contain Figure 2: Comparing t-SNE of (a) unstructured embeddings, these codes. We also created a smaller validation dataset (b) HRRAdd, (c) HRRCat, and (d) HRRBase. The t-SNE graphs consisting of the six patients with only these codes. Dur- are color-coded by the frequency of the ICD codes in the ing pretraining, the HRRBase and unstructured models dataset - highly frequent codes are colored blue while infre- did not encounter any examples using the 32 ROOD codes quent codes are colored red. and so did not explicitly learn representations for those codes. The trained models were then tested using the We computed t-SNE dimension reductions to visual- ROOD dataset. ize relationships among ICD code embeddings in the Results from Table 1 on ROOD dataset disease predic- pre-trained models. Figure 2 shows that unstructured tion show that HRRBase outperforms the unstructured embeddings of common ICD codes are clustered together embedding model for contexts of entirely unseen codes. with a large separation from those of uncommon codes. We assess statistical significance using two-tailed, in- This suggests that code-frequency information is promi- dependent t-test with unequal variance, as some mea- nently represented in these embeddings, consistent with surements failed Levene’s test for equal variance. The frequency bias in related models [6]. Common and un- means of all the metrics for HRRBase are significantly common code clusters are less distinct in HRRBase, which greater than for unstructured when making inferences does not explicitly encode frequency information. on patients with entirely unseen codes, 𝑝 < 0.001 for all As shown in Figure 1, adding code-frequency infor- metrics. Given the embedded ontological structure, we mation to the structured HRRBase embeddings, i.e. the hypothesize that HRRBase implicitly learns useful em- HRRFreq embeddings, improved the pre-training loss be beddings for the 32 unseen ROOD codes by learning any similar to unstructured embeddings. This suggests that shared embedding components of the VSA when training unstructured components in HRRAdd and HRRCat may Figure 3: t-SNE representation of sinusoidal frequency em- beddings (left), and unstructured embedding components of HRRAdd (middle) and HRRCat (right). have learned some frequency information, since these losses are also similar to the loss of models with Unstruc- tured embeddings. To investigate whether this occurred, we performed t-SNE dimension reductions of the unstruc- tured components of HRRAdd and HRRCat and colored the points by code frequency, shown in Figure 3. This Figure 4: The top-10 MLM accuracy for binned code frequen- cies in log scale. Common codes are in frequency bin 0 with graph suggests that these additional unstructured em- rarest codes being in frequency bin -12. 0.05, 0.01, and 0.001 beddings learn some frequency information, due to clus- significance levels comparing to unstructured embeddings are tering of high frequency codes. However, the frequency indicated with 1, 2, and 3 asterisks respectively. Note that information learned by HRRCat and HRRAdd learnable HRRBase is expected to perform poorly in this test due to lack embeddings influence overall embeddings less strongly of code-frequency information. in comparison to unstructured embeddings as seen in Fig- ure 2, where low frequency embeddings are less distinctly separated from higher frequency embeddings. 3.4. Top-k Accuracy for MLM Accurately predicting infrequently used disease codes is an important clinically relevant task. Given that the model trains and sees more common codes compared to rare codes, rare codes are naturally challenging to predict. Through promising empirical results on out-of- distribution mortality prediction for eICU and disease prediction on ROOD, we hypothesized that our HRR em- bedding models should have improved accuracy when predicting rare codes in the dataset compared to unstruc- tured embedding models, since rare codes should share some atomic vectors in their representations with com- mon codes. Figure 5: The top-100 MLM accuracy for binned code frequen- To test this, we evaluated the accuracy of an MLM cies in log scale. Common codes are in frequency bin 0 with pre-trained model predicting a single masked code of a rarest codes being in frequency bin -12. 0.05, 0.01, and 0.001 known frequency. We split the codes in the pre-training significance levels comparing to unstructured embeddings are validation dataset into 7 bins from log frequency -14 to 0, indicated with 1, 2, and 3 asterisks respectively. such that each bin has a width of 2. The most common codes are in a bin with log frequencies between -2 and 0, while the rarest codes are from a bin with log frequencies ent frequency bins, averaged across the three pre-training between -14 and -12. From each bin, we selected 400 models per configuration. Significant comparisons to the codes at random, repeating codes from that bin if there unstructured control at a 𝑝 < 0.05 level indicated with were fewer than 400. For each of these codes, we selected an asterisk. We assess statistical significance for each bin one patient that had that code in their history, masked using a two-tailed Dunnett’s test comparing mean accu- that code as would be done in MLM, and created a dataset racy scores of experimental HRR configurations against of these 2,800 patients to use for MLM inference. the control unstructured configuration. Notably, the top- Figure 4 and Figure 5, respectively, show the MLM top- 100 accuracy in frequency bin -12 is non-zero for the 10 and Top-100 accuracy on predicting codes in the differ- HRR methods. These codes in the rarest bin occur only Table 2 Three cosine similarity case studies looking at related ICD codes for unstructured and HRRBase. The top 4 cosine-similar ICD codes to the chosen code are listed (most to least similar) with their full description and similarity value. 2724-9 - Other and unspecified hyperlipidemia Unstructured HRRBase Pure hypercholesterolemia 0.542 Other hyperlipidemia 1.000 Hyperlipidemia, unspecified 0.482 Hyperlipidemia, unspecified 1.000 Esophageal reflux 0.304 Pure hypercholesterolemia 0.463 Anemia, unspecified 0.279 Mixed hyperlipidemia 0.418 9916-9 - Hypothermia Unstructured HRRBase Frostbite of hand 0.418 Hypothermia, initial encounter 0.794 Frostbite of foot 0.361 Hypothermia not with low env. temp. 0.592 Drowning and nonfatal submersion 0.352 Effect of reduced temp., initial encounter 0.590 Immersion foot 0.341 Other specified effects of reduced temp. 0.590 K219-10 - Gastro-esophageal reflux disease without esophagitis Unstructured HRRBase Esophageal reflux 0.565 Esophageal reflux 0.635 Hyperlipidemia, unspecified 0.335 Gastro-eso. reflux d. with esophagitis 0.512 Anxiety disorder, unspecified 0.332 Reflux esophagitis 0.512 Essential (primary) hypertension 0.326 Hypothyroidism, unspecified 0.268 once in the dataset and therefore have never been used between structured and unstructured embeddings. 30 by the model for gradient updates, since they are in the ICD codes were selected from different frequency cate- validation dataset. This suggests that the HRR methods gories in the dataset, with 10 codes drawn randomly from have some ability to provide clinically relevant informa- the 300 most common codes, 10 codes drawn randomly tion about rare codes. However, accuracy with the rarest by weighted frequency from codes appearing fewer than codes remains too low to be of practical value, perhaps 30 times in the dataset, and 10 codes randomly selected due to limited overlap of these codes’ atomic vectors with by weighted frequency from the entire dataset. For each those of more common codes. selected code, the top 4 cosine-similar ICD codes were assessed by a physician for ontological similarity. 3.5. Medical Code Case Study For each frequency category, a one-tailed Fisher’s exact test was conducted to determine whether a relationship Table 2 shows case studies for codes Other and un- existed between embedding type and clinical relatedness. specified hyperlipidemia (2724-9), Hypothermia (9916-9), We found that results in the case of the rare codes were and Gastro-esophageal Reflux disease without esophagi- statistically significant, with 𝑝 = 2.44×10−8 . With 10 rare tis (K219-10). In the first case study for 2724-9, we ob- codes and the top 4 cosine-similar ICD codes selected for serve highly ontologically similar codes, such as Other each rare code, there are 40 top cosine-similar codes in to- hyperlipidemia and Hyperlipidemia, unspecified, are en- tal. In the case of unstructured embeddings, only 4 of the coded with high cosine similarity for HRRBase, which top 40 cosine-similar codes were deemed to be strongly is not the case for unstructured embeddings. The co- ontologically related by our physician with the remain- occurrence problem can be seen in the second case study ing codes deemed to be less related and unrelated. In the for 9916-9. The most similar codes for HRRBase are medi- case of our structured HRRBase embeddings, 28 of the cally similar codes that would not usually co-occur, while top 40 cosine-similar codes were deemed to be strongly for unstructured embeddings the most similar codes co- ontologically related by our physician with the remain- occur frequently. For the final case study on K219-10, ing codes deemed to be less related and unrelated. This frequency-related bias can be observed in the unstruc- suggests that knowledge-integrated structured embed- tured embeddings with frequent but mostly ontologically dings are associated with greater clinical relevance of the unrelated codes as part of the top list of cosine similar top cosine-similar codes than unstructured embeddings codes, whereas the top list of cosine similar codes for for rare codes where little training data exists. HRRBase contains medically similar codes. We broadened this case study to test statistical dif- ferences in cosine and semantic embedding similarity 4. Discussion to benefit from training of the other, and may help to align the representations of codes and text. Transformers have leading performance in many applica- tions, but their internal processes are opaque, emerging from enormous parameter sets and data volumes beyond 5. Conclusion human experience. It is hard to know when they can be trusted. For example, generative transformers are We proposed a novel hybrid neural-symbolic approach prone to subtle confabulations. Transformers have a called HRR-BERT that integrates medical ontologies rep- general-purpose architecture that performs as well in resented by HRR embeddings. In tests with the MIMIC- vision and other modalities as in language. They are a IV dataset, HRRBERT models modestly outperformed culmination of a key trend in artificial intelligence, away baseline models with unstructured embeddings for pre- from problem-specific engineering, and toward massive training, disease prediction accuracy, mortality predic- data and computation. This trend is justified in terms of tion F1, and fine-tuning tasks involving infrequently seen performance. However, given two models with equal per- codes. HRRBERT models had pronounced performance formance, one with more explicit conceptual structure is advantages in MLM with rare codes and disease pre- preferable in terms of trust and explainability. diction for patients with no codes seen during training The work presented here is a step in this direction, with (ROOD - Unseen in Table 1). We also showed that HRRs our HRRBase embeddings that have explicit conceptual can be used to create medical code embeddings that bet- structure and perform equivalently or better compared ter respect ontological similarities for rare codes. A key to typical transformer embeddings. The benefit of struc- benefit of our approach is that it facilitates explainability tured embeddings becomes more pronounced for tasks by disentangling token-frequency information, which that involve codes that are rare or are not present in is prominently represented but implicit in unstructured training data. HRR embeddings can also be relied on to embeddings. represent medical meaning rather than co-occurrence in the training data. They also untangle the representation References of code frequency, so that it can be included or not, and its effects on decisions understood. Importantly, despite [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, this additional structure, the embeddings are thoroughly L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At- learned, suggesting that the approach will be consistent tention is all you need, 2017. arXiv:1706.03762 . with high performance beyond the examples we have [2] L. Rasmy, Y. Xiang, Z. Xie, C. Tao, Med-bert: studied. pretrained contextualized embeddings on large- As our method scales with and leverages PyTorch au- scale structured electronic health records for dis- tograd in the construction of the vector-symbolic em- ease prediction, npj Digital Medicine 4 (2021) 86. beddings, it is compatible with existing medical LLM doi:10.1038/s41746- 021- 00455- y . architectures as an embedding component capable of [3] A. Ganesan, H. Gao, S. Gandhi, E. Raff, T. Oates, encoding domain knowledge. J. Holt, M. McLean, Learning with holographic Future work could explore the potential of these struc- reduced representations, CoRR abs/2109.02157 tured embeddings for explaining and controlling the ob- (2021). URL: https://arxiv.org/abs/2109.02157. served frequency bias. As HRRs can be queried with lin- arXiv:2109.02157 . ear operations, future work could also explore whether [4] M. K. Sarker, L. Zhou, A. Eberhart, P. Hitzler, Neuro- transformers can learn to extract specific information symbolic artificial intelligence, AI Communications from these composite embeddings. Limitations to ad- 34 (2021) 197–209. dress in future work include the complexity of processing [5] S. Ramgopal, L. N. Sanchez-Pinto, C. M. Horvat, knowledge graphs to be compatible with HRRs. Another M. S. Carroll, Y. Luo, T. A. Florin, Artificial important limitation is that our method relies on rare- intelligence-based clinical decision support in pedi- code HRRs sharing atomic elements with common-code atrics, Pediatric research 93 (2023) 334–341. HRRs. However, in SNOMED CT, rare codes are likely to [6] T. Yu, T. Tuinstra, B. Hu, R. Rezai, T. Fortin, R. Di- contain some rare atomic elements. To address this point, Maio, B. Vartian, B. Tripp, Frequency bias in mlm- in addition to SNOMED CT, knowledge could be encoded trained bert embeddings for medical codes, CMBES from sources such as pre-trained medical embeddings, Proceedings 45 (2023). URL: https://proceedings. different medical ontologies, and other medical domain cmbes.ca/index.php/proceedings/article/view/1050. knowledge to further improve our proposed methodol- [7] S. Biderman, H. Schoelkopf, Q. G. Anthony, ogy. In LLMs that process both medical codes and text, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, it would make sense to share word embeddings between S. Purohit, U. S. Prashanth, E. Raff, et al., Pythia: modalities. This would allow training of each modality A suite for analyzing large language models across training and scaling, in: International Conference and holistic data representation, in: 2023 Design, on Machine Learning, PMLR, 2023, pp. 2397–2430. Automation Test in Europe Conference Exhibition [8] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, (DATE), 2023, pp. 1–6. doi:10.23919/DATE56975. H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, 2023.10137134 . S. Pfohl, et al., Large language models encode clini- [20] M. Nickel, L. Rosasco, T. Poggio, Holographic em- cal knowledge, Nature 620 (2023) 172–180. beddings of knowledge graphs (2015). URL: http: [9] T. Plate, Holographic reduced representations, IEEE //arxiv.org/abs/1510.04935. doi:10.48550/arXiv. Transactions on Neural Networks 6 (1995) 623–641. 1510.04935 , arXiv:1510.04935 [cs, stat]. doi:10.1109/72.377968 . [21] T. Dash, A. Srinivasan, L. Vig, Incor- [10] A. Johnson, L. Bulgarelli, T. Pollard, S. Horng, L. A. porating symbolic domain knowledge Celi, R. Mark, Mimic-iv (version 2.0, 2022. URL: into graph neural networks, Machine https://doi.org/10.13026/7vcr-e114. doi:10.13026/ Learning 110 (2021) 1609–1636. URL: 7vcr- e114 . https://doi.org/10.1007%2Fs10994-021-05966-z. [11] P. Kanerva, Hyperdimensional computing: An in- doi:10.1007/s10994- 021- 05966- z . troduction to computing in distributed representa- [22] M. Kulmanov, F. Z. Smaili, X. Gao, R. Hoehn- tion with high-dimensional random vectors, Cog- dorf, Semantic similarity and machine learning nitive Computation 1 (2009) 139–159. URL: https: with ontologies, Briefings in Bioinformat- //api.semanticscholar.org/CorpusID:733980. ics 22 (2020) bbaa199. URL: https://doi.org/ [12] R. W. Gayler, Vector symbolic architectures answer 10.1093/bib/bbaa199. doi:10.1093/bib/bbaa199 . jackendoff’s challenges for cognitive neuroscience, arXiv:https://academic.oup.com/bib/article- 2004. arXiv:cs/0412059 . pdf/22/4/bbaa199/39132158/bbaa199.pdf . [13] P. Neubert, S. Schubert, Hyperdimensional com- [23] V. Riikka, V. Anne, P. Sari, Systematized nomencla- puting as a framework for systematic aggregation ture of medicine-clinical terminology (snomed ct) of image descriptors (2021). URL: http://arxiv.org/ clinical use cases in the context of electronic health abs/2101.07720. doi:10.48550/arXiv.2101.07720 , record systems: Systematic literature review, JMIR arXiv:2101.07720 [cs]. Med Inform (2023). doi:10.2196/43750 . [14] P. Neubert, S. Schubert, K. Schlegel, P. Protzel, Vec- [24] L. Song, C. W. Cheong, K. Yin, W. K. Cheung, tor semantic representations as descriptors for vi- B. C. M. Fung, J. Poon, Medical concept embed- sual place recognition, in: Robotics: Science and ding with multiple ontological representations, in: Systems XVII, Robotics: Science and Systems Foun- Proceedings of the Twenty-Eighth International dation, 2021. URL: http://www.roboticsproceedings. Joint Conference on Artificial Intelligence, IJCAI- org/rss17/p083.pdf. doi:10.15607/RSS.2021.XVII. 19, International Joint Conferences on Artificial 083 . Intelligence Organization, 2019, pp. 4613–4619. [15] A. Rahimi, P. Kanerva, L. Benini, J. M. Rabaey, Effi- URL: https://doi.org/10.24963/ijcai.2019/641. doi:10. cient biosignal processing using hyperdimensional 24963/ijcai.2019/641 . computing: Network templates for combined learn- [25] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. ing and classification of exg signals, Proceedings of Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, the IEEE 107 (2019) 123–143. doi:10.1109/JPROC. G. B. Moody, C.-K. Peng, H. E. Stanley, Phys- 2018.2871163 . ioBank, PhysioToolkit, and PhysioNet: Compo- [16] K. Schlegel, P. Neubert, P. Protzel, Hdc- nents of a new research resource for complex phys- minirocket: Explicit time encoding in time series iologic signals, Circulation 101 (2000) e215–e220. classification with hyperdimensional computing Circulation Electronic Pages: http://circ.ahajour- (2022). URL: http://arxiv.org/abs/2202.08055. doi:10. nals.org/content/101/23/e215.full PMID:1085218; 48550/arXiv.2202.08055 , arXiv:2202.08055 [cs]. doi: 10.1161/01.CIR.101.23.e215. [17] P. Smolensky, Tensor product variable binding and [26] J. Devlin, M. Chang, K. Lee, K. Toutanova, the representation of symbolic structures in con- BERT: pre-training of deep bidirectional trans- nectionist systems, Artificial Intelligence 46 (1990) formers for language understanding, CoRR 159–216. URL: https://www.sciencedirect.com/ abs/1810.04805 (2018). URL: http://arxiv.org/abs/ science/article/pii/000437029090007M. doi:https: 1810.04805. arXiv:1810.04805 . //doi.org/10.1016/0004- 3702(90)90007- M . [27] NLM, Snomed ct to icd-10-cm map, 2022. URL: [18] M. M. Alam, E. Raff, S. Biderman, T. Oates, J. Holt, https://www.nlm.nih.gov/research/umls/mapping_ Recasting self-attention with holographic reduced projects/icd9cm_to_snomedct.html. representations, 2023. arXiv:2305.19534 . [28] OHDSI, Ohdsi standardized vocabularies, 2019. [19] J. Kim, H. Lee, M. Imani, Y. Kim, Efficient hyper- URL: https://github.com/OHDSI/Vocabulary-v5.0/ dimensional learning with trainable, quantizable, wiki. [29] NLM, Icd-9-cm diagnostic codes to snomed ct map, 26. G249-10: Dystonia, unspecified 2022. URL: https://www.nlm.nih.gov/research/ 27. 9100-9: Abrasion or friction burn of face, neck, umls/mapping_projects/icd9cm_to_snomedct. and scalp except eye, without mention of infec- html. tion [30] NCHS, Diagnosis code set general equivalence 28. 78906-9: Abdominal pain, epigastric mappings, 2018. URL: https://ftp.cdc.gov/pub/ 29. E8889-9: Unspecified fall health_statistics/nchs/Publications/ICD10CM/ 30. 30500-9: Alcohol abuse, unspecified 2018/Dxgem_guide_2018.pdf. 31. G520-10: Disorders of olfactory nerve [31] T. J. Pollard, A. E. W. Johnson, J. D. Raffa, L. A. Celi, R. G. Mark, O. Badawi, The eICU Collaborative 32. 8020-9: Closed fracture of nasal bones Research Database, a freely available multi-center database for critical care research, Scientific data 5 A.1. Learning through HRR Operations (2018) 1–13. Efficiently To make the HRR concept embeddings useful for a deep neural network, the operations used to form the embed- A. List of 32 ROOD Codes dings need to be compatible with backpropagation so that gradient descent can update the lower-level atomic The following is the list of 32 ROOD codes: vectors. We desired a function that produced the ICD 1. G248-10: Other dystonia concept embedding matrix, 𝐶, given the inputs of the VSA 2. E8498-9: Accidents occurring in other specified knowledge graphs, 𝒢𝑖 , and symbol embedding matrices, places 𝑅 and 𝐴. 3. E9688-9: Assault by other specified means We attempted three approaches to computing 𝐶 through VSA operations. First, we naively tried to com- 4. Z681-10: Body mass index (BMI) 19.9 or less, adult pute each concept vector in 𝐶 one at a time. However, 5. 30550-9: Opioid abuse, unspecified this approach was too slow in both forward and back- 6. R262-10: Difficulty in walking, not elsewhere clas- ward pass, requiring more than 1 second for each pass. sified Our second approach was using slices of 𝐺 along the re- 7. E887-9: Fracture, cause unspecified lationship dimension as a sparse binary matrix, which, 8. R471-10: Dysarthria and anarthria when multiplied with 𝐴, would perform the indexing 9. 9916-9: Hypothermia and summing of atomic vectors for each concept. This 10. E9010-9: Accident due to excessive cold due to result can be convolved with the relationship vector and weather conditions added to the concept embedding matrix. This approach 11. F10129-10: Alcohol abuse with intoxication, un- was much faster and used a moderate amount of memory specified for one of our less complex VSA formulations. However, 12. E8499-9: Accidents occurring in unspecified place when dealing with our most complex formulation, it used 13. R636-10: Underweight ∼15 GB of memory. 14. 920-9: Contusion of face, scalp, and neck except Our final approach took advantage of the fact that eye(s) many disease concepts use relationship, but to different atomic symbols. Also, number of times a concept uses 15. R4182-10: Altered mental status, unspecified a particular relationship is relatively low, except for the 16. 95901-9: Head injury, unspecified SNOMED “isA” relationship and our defined “descrip- 17. 78097-9: Altered mental status tion” relationship. Thus, for a particular relationship, 18. F29-10: Unspecified psychosis not due to a sub- we can contribute to building many disease concept vec- stance or known physiological condition tors at once by selecting many atomic vectors, doing a 19. Z880-10: Allergy status to penicillin vectorized convolution with the relationship vector, and 20. Z818-10: Family history of other mental and be- distributing the results to be added with the appropriate havioral disorders concept embedding rows. This step needs to be repeated 21. 81600-9: Closed fracture of phalanx or phalanges at most 𝑚 times for a particular relationship, where 𝑚 is of hand, unspecified the maximum multiplicity of that relationship among all 22. 87341-9: Open wound of cheek, without mention concepts. We improved memory efficiency by perform- of complication ing fast Fourier transforms (FFTs) on the atomic vector 23. H9222-10: Otorrhagia, left ear embeddings and construct the concept vectors by per- 24. Z978-10: Presence of other specified devices forming binding via element-wise multiplication in the Fourier domain. Due to the linearity of the HRR opera- 25. G20-10: Parkinson’s disease tions, we performed a final FFT on the complex-valued concept embedding to convert back to the real domain. The final approach is much faster than the first ap- proach since it takes advantage of vectorized operations to contribute to many concept vectors at once. It is also more memory efficient than the second approach since all the intermediate results are dense, so allocations are not wasted on creating mostly sparse results. On our most complex formulation, this approach uses ∼3.5 GB of memory, and takes ∼80 ms and ∼550 ms for forward and backward pass respectively.