-

ACM KDD Conference, August

1613-0073

Representations for Transformers

Bing Hu

bingxu.hu@uwaterloo.ca 1 2 3

Trevor Yu

1 2 3

Tia Tuinstra

1 2 3

Ryan Rezai

1 2 3

Harshit Bokadia

1 2 3

Rachel DiMaio

1 2 3

Thomas Fortin

1 2 3

Brian Vartian

0 1 2 3

Bryan Tripp

1 2 3

Deep Learning, Ontology, Knowledge-Integration

0 McMaster University , Ontario , Canada 1 University of Waterloo , Ontario , Canada 2 We test our method , Holographic Reduced Representa- 3 tion Bi-directional Encoder Representations from Trans-

2024

26 2024

Transformer models trained on NLP tasks with medical codes often have randomly initialized embeddings that are then adjusted based on training data. For terms appearing infrequently in the dataset, there is little opportunity to improve these representations and learn semantic similarity with other concepts. Medical ontologies represent many biomedical concepts and define a relationship structure between these concepts, making ontologies a valuable source of domain-specific information. Holographic Reduced Representations (HRR) are capable of encoding ontological structure by composing atomic vectors to create structured higher-level concept vectors. We developed an embedding layer that generates concept vectors for clinical diagnostic codes by applying HRR operations that compose atomic vectors based on the SNOMED CT ontology. This approach allows for learning the atomic vectors while maintaining structure in the concept vectors. We trained a Bidirectional Encoder Representations from the Transformers (BERT) model to process sequences of clinical diagnostic codes and used the resulting HRR concept vectors as the embedding matrix for the model. The HRR-based approach introduced interpretable structure into code embeddings while maintaining or modestly improving performance on the masked language modeling (MLM) pre-training task (particularly for rare codes) as well as the fine-tuning tasks of mortality and disease prediction. This approach also better maintains semantic similarity between medically related concept vectors, due to both shared atomic vectors and disentangling of code-frequency information.

CEUR ceur-ws.org

1. Introduction Transformers [1] jointly optimize high-dimensional vec

work that contextualizes and transforms these embedtor embeddings that represent input tokens, and a net- information and potentially clinical harm [ 6, 7, 8 ]. Here we use a novel neuro-symbolic medical transdings to perform a task. Originally designed for natu- former architecture incorporating structured knowledge in medical applications [ 5 ]. Standard large language models (LLMs) can be prone to biases in the training data, such as frequency bias, which can result in medical misrecords (EHR). A prominent example in this space is Med- the architecture to optimize the embeddings of atomic ral language processing (NLP) tasks, transformers are now widely used with other data modalities. In medical applications, one important modality consists of medical codes that are extensively used in electronic health

BERT [2], which consumes a sequence of diagnosis codes. Tasks that Med-BERT and other EHR-transformers per

form include disease and mortality prediction.

Deep networks have traditionally been alternatives to symbolic artificial intelligence with diferent advantages [ 3 ]. Deep networks use real-world data efectively, but symbolic approaches have completive properties, such as better transparency and capacity for incorporating structured information, inspiring many eforts to combine the tional transparency and ability to incorporate structured information are potential benefits of symbolic approaches formers (HRRBERT), on the Medical Information Mart for Intensive Care (MIMIC)-IV dataset [10] and show improvements in both pre-training and fine-tuning tasks.

We also show that our embeddings of ontologically sim

contrast with embeddings that are learned in the standard way. Finally, we investigate learned representations of medical-code frequency, in light of recent demonstration of frequency bias in EHR-transformers [ 6 ].

We contribute: • A novel neuro-symbolic architecture, HRRBERT, two approaches in neuro-symbolic systems [ 4 ]. Addi- ilar rare medical codes have high cosine similarity, in that combines vector-symbolic embeddings with upon Smolensky’s by using circular convolution as the the BERT LLM architecture, leading to better per- binding operator [9]. Circular convolution keeps the formance in medical tasks. output in the same dimension, solving the problem of • Eficient construction of vector-symbolic embed- exploding dimensionality.

dings that leverage PyTorch autograd on GPUs. In the field of deep learning, HRRs have been used • Optimized medical-code embeddings that better in previous work to recast self-attention for transformer respect semantic similarity of medical terminol- models [18], to improve the eficiency of neural networks ogy than standard embeddings for infrequently performing a multi-label classification task by using an used codes. HRR-based output layer [ 3 ], and as a learning model itself with a dynamic encoder that is updated through train

We focus here on processing medical codes, but our ing [19]. In all of these works, the eficiency and simple methods would extend naturally to foundation models arithmetic of HRRs are leveraged. Our work difers in that combine medical codes and natural language. Specif- that we also leverage the ability of HRRs to create strucically, the trained atomic vectors of our vector-symbolic tured vectors to represent complex concepts as inputs to embeddings could share a dictionary with language em- a transformer model. beddings, so that training of each could improve the VSAs such as HRRs can efectively encode domain representation of the other. knowledge, including complex concepts and the relationships between them. For instance, Nickel et al. [20] pro1.1. Background and Related Works pose holographic embeddings that make use of VSA properties to learn and represent knowledge graphs. EncodThe Vector-Symbolic Architectures (VSA) approach is a ing domain knowledge is of interest in the field of deep computing paradigm that relies on high dimensionality learning, as it could improve, for example, a deep neural and randomness to represent concepts as unique vectors network’s ability to leverage human knowledge and to in a high dimensional space [11]. VSAs create and manip- communicate its results within a framework that humans ulate distributed representations of concepts by combin- understand [21]. Ontologies are a form of domain knowling base vectors with bundling, binding, and permutation edge incorporated into machine learning models to use algebraic operators [12]. For example, a scene with a red background knowledge to create embeddings with meanbox and a green ball could be described with the vector ingful similarity metrics and for other purposes [22]. In SCENE=RED⊗BOX+GREEN⊗BALL, where ⊗ indicates our work, we use HRRs to encode domain knowledge binding, and + indicates bundling. The atomic concepts in trainable embeddings for a transformer model. The of RED, GREEN, BOX, and BALL are represented by base domain knowledge we use comes from the Systematized vectors, which are typically random. VSAs also define Nomenclature of Medicine Clinical Terms (SNOMED CT), an inverse operation that allows the decomposition of a which is a widely used clinical ontology system that incomposite representation. For example, the scene rep- cludes definitions of relationships between clinical conresentation could be queried as SCENE⊗BOX−1. This cepts [23]. should return the representation of GREEN or an approx- To the best of our knowledge, HRRs have not been used imation of GREEN that is identifiable when compared to before as embeddings for transformer models. Transa dictionary. In a VSA, the similarity between concepts former models typically use learned embeddings with can be assessed by measuring the distance between the random initializations [ 1 ]. However, in the context of reptwo corresponding vectors. resenting ontological concepts, using such unstructured

VSAs were proposed to address challenges in mod- embeddings can have undesirable efects. One problem elling cognition, particularly language [12]. However, is the inconsistency between the rate of co-occurrence VSAs have been successfully applied across a variety of or patterns of occurrence of medical concepts and their domains and modalities outside of the area of language degree of semantic similarity described by the ontology. as well, including in vision [13, 14], biosignal process- For example, the concepts of “Type I Diabetes” and “Type ing [15], and time-series classification [ 16]. Regardless II Diabetes” are mutually exclusive in EHR data and do of the modality or application, VSAs provide value by not follow the same patterns of occurrence due to difenriching vectors with additional information, such as ferences in pathology and patient populations [24]. The spatial semantic information in images and global time diferences in occurrence make it dificult for a transencoding in time series. former model to learn embeddings with accurate simi

An early VSA framework was Smolensky’s Tensor larity metrics. The concepts should have relatively high Product Representation [17], which addressed the need similarity according to the ontology. They both share a for compositionality, but sufered from exploding model common ancestor of “Diabetes Mellitus,” they are both dimensionality. The VSA framework introduced by Plate, metabolic disorders that afect blood glucose levels, and Holographic Reduced Representations (HRR), improved they can both lead to similar health outcomes. Song et al. [24] seeks to address this type of inconsistency by train- [ 26 ], including the atomic vectors for HRR embeddings. ing multiple “multi-sense” embeddings for each non-leaf Fine-tuning used a constant learning rate schedule with a node in an ontology’s knowledge graph via an attention weight decay of 4e-6. Fine-tuning lasted 10 epochs with mechanism. However, the “multi-sense” embeddings do a batch size of 80. not address the learned frequency-related bias that also arises from the co-occurrence of concepts. Frequency- 2.3. Encoding SNOMED Ontology with related bias raises an explainability issue, as it leads to learned embeddings that do not reflect true similarity HRR Embeddings relationships between concepts, for example, as defined In this section, we detail the methodologies of constructin an ontology, but instead reflect the frequency of the ing vector embeddings for ICD disease codes using HRR concepts in the dataset [ 6 ]. This bias particularly afects operations based on the SNOMED CT structured clinicodes that are used less frequently. cal vocabulary. We first describe our mapping from ICD

Our proposed approach, HRRBERT, uses the structure concepts to SNOMED CT terms. Next, we define how from SNOMED CT to represent thousands of concepts the atomic symbols present in the SNOMED CT ontology with high-dim-ensional vectors such that each vector are combined using HRR operations to construct conreflects a particular clinical meaning and can be compared cept vectors for the ICD codes. Finally, we describe our to other vectors using the HRR similarity metric, cosine method to eficiently compute the HRR embedding masimilarity. It also leverages the computing properties of trix using default PyTorch operations that are compatible HRRs to provide structured embeddings for a LLM that with autograd. supports optimization through backpropagation.

2. Methods 2.1. MIMIC-IV Dataset The data used in this study was derived from the Med

ical Information Mart for Intensive Care (MIMIC) v2.0 database, which is composed of de-identified EHRs from in-patient hospital visits between 2008 and 2019 [10]. MIMIC-IV is available through PhysioNet [25]. We used the ICD-9 and ICD-10 diagnostic codes from the icd_diagnosis table from the MIMIC-IV hosp module. We filtered patients who did not have at least one diagnostic code associated with their records. Sequences of codes were generated per patient by sorting their hospital visits by time. Within one visit, the order of codes from the MIMIC-IV database was used, since it represents the relative importance of the code for that visit. Each unique code was assigned a token. In total, there were 189,980 patient records in the dataset. We used 174,890 patient records for pre-training, on which we performed a 90–10 training-validation split. We reserved 15k records for ifne-tuning tasks.

2.2. Model Architecture We utilized a BERT-base model architecture with a post

layer norm position and a sequence length of 128 ICD codes [ 26 ]. A custom embedding class was used to support the functionality required for our HRR embeddings. We adapted the BERT segment embeddings to represent groups of codes from the same hospital visit, using up to 100 segment embeddings to encode visit sequencing. An embedding dimension of = 768 was used, and all embeddings were initialized from ∼ (0, 0.02), as in 2.3.1. Mapping ICD to SNOMED CT Ontology

Our data uses ICD-9 and ICD-10 disease codes while

our symbolic ontology is defined in SNOMED CT, so we required a mapping from the ICD to the SNOMED CT system to build our symbolic architecture. We used the SNOMED CT International Release from May 31, 2022 [23] and only included SNOMED CT terms that were active at the time of that release. While SNOMED publishes a mapping tool from SNOMED CT to ICD-10, a majority of ICD-10 concepts have one-to-many mappings in the ICD-to-SNOMED CT direction [ 27 ]. To increase the fraction of one-to-one mappings, we used additional published mappings from the Observational Medical Outcomes Partnership (OMOP) [ 28 ], mappings from ICD-9 directly to SNOMED CT [ 29 ], and mappings from ICD-10 to ICD-9 [ 30 ].

Notably, after excluding ICD codes with no active SNOMED CT mapping, 671 out of the 26,164 unique ICD codes in the MIMIC-IV dataset were missing mappings. When those individual codes were removed, a data volume of 4.62% of codes was lost. This removed 58 out of 190,180 patients from the dataset, as they had no valid ICD codes in their history. Overall, the remaining 25,493 ICD codes mapped to a total of 12,263 SNOMED CT terms. 2.3.2. SNOMED CT vector symbolic architecture Next, we define how the contents of the SNOMED CT ontology were used to construct a symbolic graph to represent ICD concepts. For a given SNOMED CT term, we used its descriptive words and its relationships to other SNOMED CT terms. A relationship is defined by a relationship type and a target term. In total, there concepts. In the ontology, many ICD concepts share

SNOMED CT terms in their representations. The set of relationships was not necessarily unique

for each SNOMED CT term. To add more unique information, we used a term’s “fully specified name” and any “synonyms” as an additional set of words describing that term. We set all text to lowercase, stripped punctuation, and split on spaces to create a vocabulary of words.

We removed common English stopwords from a custom stopword list that was collected with assistance from a medical physician. The procedure resulted in a total of 8833 vocabulary words. Overall, there were a total of 22,725 “atomic” symbols

for the VSA which included the SNOMED CT terms, relationships, and the description vocabulary. Each symbol was assigned an “atomic vector”. We built a “concept vector” for each of the target 25,493 ICD codes using HRR operations to combine atomic vectors according to the

SNOMED CT ontology structure. To build a -dimensional concept vector for a given

ICD concept, we first considered the set of all relationships that the concept maps to. We used the HRR operator for binding, circular convolution, to combine vectors representing the relationship type and destination term and defined the concept vector to be the bundling of these bound relationships. For the description words, we bundled the vectors representing each word together and bound this result with a new vector representing the relationship type “description,” as shown in Equation 1. ICD concept =

∑ SNOMED CT rel ⊛ term + ∑ desc ⊛ word words Formally, let ∶ {1, 2, ..., } be the set of integers enumerating the unique atomic symbols for SNOMED

CT terms and description words. Let ∶ {1, 2, ...,

the set of integers enumerating unique relationships for SNOMED CT terms, including the description relation } be ship and the binding identity. Let ∶ {1, 2, ..., set of integers enumerating the ICD-9 and ICD-10 disease concepts represented by the VSA.

has an associated embedding matrix } be the ∈ ℝ × , where atomic vector = [,∶] , ∈ is the -th row the embedding matrix. Similarly, there is relationship (1) (2) and an ICD concept embedding matrix, ∈ ℝ × and = [,∶] , ∈ . We described the VSA with the formula in Equation 2, where is a graph representing the connections between ICD concept to atomic symbols by relationship .

∑ ⊛ (,)∈

Additional details on how to eficiently use PyTorch

were 13,852 SNOMED CT target terms and 40 SNOMED autograd to learn through these HRR operations are proembedding matrix, ∈ ℝ × and = [,∶] , ∈ ; For each of the 3 pre-trained models, 10 fine-tuning trials 2.3.3. Embedding Configurations

We call our method of constructing embeddings for ICD

codes purely from HRR representations “HRRBase” and the standard method of creating transformer token embeddings from random vectors “unstructured”. While the

HRRBase configuration enforces the ontology structure,

we wondered whether it would be too rigid and have dififculty representing information not present in SNOMED

CT. As dataset frequency information for ICD medical

codes is not present in the HRR structure, we tried adding an embedding that represented the empirical frequency of that ICD code in the dataset. We also tried adding fully learnable embeddings with no prior structure.

Given the wide range of ICD code frequencies in

MIMIC, we log-transformed the empirical ICD code frequencies, and then discretized the resulting range. For our HRRFreq configuration, we used the sinusoidal frequency encoding as in [ 1 ] to encode the discretized logfrequency information. The frequency embeddings were normalized before being summed with the HRR embedding vectors.

We defined two additional configurations in which

a standard embedding vector was integrated with the structured HRR concept vector. With “HRRAdd”, a learnable embedding was added to the concept embedding, HRRAdd = + add, add ∈ ℝ × . However, this roughly doubled the number of learnable parameters compared to other formulations.

With “HRRCat”, a learnable embedding of dimension /2

was concatenated with the HRR concept embedding of dimension /2 . This keeps the total number of learnable parameters roughly the same as the unstructured configuration (25,493 -dimensional vectors) and the HRRBase configuration (22,725 -dimensional vectors). The final embedding matrix was defined as HRRCat = [ cat], where , cat ∈ ×/2 .

2.4. Experiments We pre-trained the unstructured, HRRBase, HRRCat, and HRRAdd embedding configurations of HRRBERT on the

masked language modelling (MLM) task, for 3 trials each. were conducted for a total of 30 trials per fine-tuning task.

The best checkpoint from the 10 epochs of fine-tuning

was saved based on validation performance. A test set containing 666 patient records was used to evaluate each of the fine-tuned models for both mortality and disease prediction. We report accuracy, precision, recall, and F1 scores averaged over the 30 trials for the fine-tuning tasks. 3. Experimental Results 3.1. Pre-training date. A training set of 13k patient records along with a validation set of 2k patient records were used to fine-tune each model on mortality prediction. Table 1 shows the evaluation results of mortality prediction for each of the configurations. We performed a two-sided Dunnett’s test to compare our multiple experimental HRR embedding configurations to the control unstructured embeddings, with < 0.05 significance level. HRRBase embeddings had a significantly greater mean F1-score ( = 0.043) and precision ( = 0.042) compared to unstructured embeddings. 3.2.2. Disease Prediction Task

The disease prediction task is defined as predicting which

Figure 1: Pre-training validation set evaluation results for disease chapters were recorded in the patient’s last visit diferent configurations using information from earlier visits. We converted all ICD codes in a patient’s last visit into a multi-label bi

MLM accuracy is evaluated on a validation set over the nary vector of disease chapters. As there are 22 disease course of pre-training. Pre-training results for diferent chapters defined in ICD-10, the multi-label binary vector configurations are shown in Figure 1. The pre-training has a size of 22 with binary values corresponding to the results are averaged over 3 runs for each of the configu- presence of a disease in each chapter. A training set of rations except for HRRFreq where only 1 model run was 4.5k patient records along with a validation set of 500 completed. patient records were used to fine-tune each model on

The baseline of learned unstructured embeddings has this task. Table 1 shows the evaluation results of disease a peak pre-training validation performance of around prediction for each of the configurations. For the two33.4%. HRRBase embeddings perform around 17% worse sided Dunnett test, Levene’s test shows that the equal compared to the baseline of learned unstructured embed- variance condition is satisfied, and the Shapiro-Wilk test dings. We hypothesize that this decrease in performance suggests normal distributions except for HRRAdd accuis due to a lack of embedded frequency information in racy. The test showed HRRBase embeddings had a signifHRRBase compared to learned unstructured embeddings. icantly greater mean accuracy ( = 0.033) and precision HRRFreq (which combines SNOMED CT information ( = 0.023) compared to unstructured embeddings. No with frequency information) has a similar performance other comparisons of mean metrics for HRR embeddings compared to unstructured embeddings, supporting this were significantly greater than the control. hypothesis. Compared to baseline, HRRAdd and HRRCat improve pre-training performance by a modest margin of 3.2.3. eICU Mortality Prediction around 2%. We posit that this almost 20% increase in performance of HRRCat and HRRAdd over HRRBase during pre-training is partly due to the fully learnable embedding used in HRRCat and HRRAdd learning frequency information.

An additional experiment conducted on the Philips Electronic Intensive Care Unit (eICU) [ 31 ] shows corroborating results with the MIMIC-IV experiments. For our experiment, we applied our mortality prediction models that were fine-tuned on MIMIC-IV to eICU data to see if our results generalize. Table 1 shows that HRRBase embeddings had a significantly greater mean accuracy ( = 0.046 ) compared to unstructured embeddings when applied to the eICU dataset. These models are not optimized for mortality prediction for other hospitals where coding methodology and clinical practice may difer. For example, the most common code in the eICU dataset represents acute respiratory failure, whereas the most common code in the MIMIC-IV dataset represents hypertension.

3.2. Fine-tuning We fine tuned the networks for mortality prediction and disease prediction. Across metrics and tasks, the best results were often seen in HRRBase (Table 1) with some being statistically significant.

3.2.1. Mortality Prediction Task

The mortality prediction task is defined as predicting patient mortality within 6 months after the last visit. Binary mortality labels were generated by comparing the time diference between the last visit and the mortality

We conducted an additional disease-prediction experiment to test generalization to patients with codes outside the training distribution. We found six patients with records that consisted of only 32 codes between them (see list of codes in Appendix A). We created a reallyout-of-distribution (ROOD) dataset that consisted of all patients in MIMIC-IV (nearly 30K) with at least one of these codes. We used this as a validation set. The separate pre-training and fine-tuning dataset did not contain these codes. We also created a smaller validation dataset consisting of the six patients with only these codes. During pretraining, the HRRBase and unstructured models did not encounter any examples using the 32 ROOD codes and so did not explicitly learn representations for those codes. The trained models were then tested using the ROOD dataset.

Results from Table 1 on ROOD dataset disease prediction show that HRRBase outperforms the unstructured embedding model for contexts of entirely unseen codes. We assess statistical significance using two-tailed, independent t-test with unequal variance, as some measurements failed Levene’s test for equal variance. The means of all the metrics for HRRBase are significantly greater than for unstructured when making inferences on patients with entirely unseen codes, < 0.001 for all metrics. Given the embedded ontological structure, we hypothesize that HRRBase implicitly learns useful embeddings for the 32 unseen ROOD codes by learning any shared embedding components of the VSA when training

3.3. t-SNE of Frequency Bias We computed t-SNE dimension reductions to visual

ize relationships among ICD code embeddings in the pre-trained models. Figure 2 shows that unstructured embeddings of common ICD codes are clustered together with a large separation from those of uncommon codes. This suggests that code-frequency information is prominently represented in these embeddings, consistent with frequency bias in related models [ 6 ]. Common and uncommon code clusters are less distinct in HRRBase, which does not explicitly encode frequency information.

As shown in Figure 1, adding code-frequency information to the structured HRRBase embeddings, i.e. the HRRFreq embeddings, improved the pre-training loss be similar to unstructured embeddings. This suggests that unstructured components in HRRAdd and HRRCat may have learned some frequency information, since these losses are also similar to the loss of models with Unstructured embeddings. To investigate whether this occurred, we performed t-SNE dimension reductions of the unstructured components of HRRAdd and HRRCat and colored the points by code frequency, shown in Figure 3. This graph suggests that these additional unstructured embeddings learn some frequency information, due to clustering of high frequency codes. However, the frequency information learned by HRRCat and HRRAdd learnable embeddings influence overall embeddings less strongly in comparison to unstructured embeddings as seen in Figure 2, where low frequency embeddings are less distinctly separated from higher frequency embeddings.

3.4. Top-k Accuracy for MLM

Accurately predicting infrequently used disease codes is an important clinically relevant task. Given that the model trains and sees more common codes compared to rare codes, rare codes are naturally challenging to predict. Through promising empirical results on out-ofdistribution mortality prediction for eICU and disease prediction on ROOD, we hypothesized that our HRR embedding models should have improved accuracy when predicting rare codes in the dataset compared to unstructured embedding models, since rare codes should share some atomic vectors in their representations with common codes. Figure 5: The top-100 MLM accuracy for binned code frequen

To test this, we evaluated the accuracy of an MLM cies in log scale. Common codes are in frequency bin 0 with pre-trained model predicting a single masked code of a rarest codes being in frequency bin -12. 0.05, 0.01, and 0.001 known frequency. We split the codes in the pre-training significance levels comparing to unstructured embeddings are validation dataset into 7 bins from log frequency -14 to 0, indicated with 1, 2, and 3 asterisks respectively. such that each bin has a width of 2. The most common codes are in a bin with log frequencies between -2 and 0, while the rarest codes are from a bin with log frequencies ent frequency bins, averaged across the three pre-training between -14 and -12. From each bin, we selected 400 models per configuration. Significant comparisons to the codes at random, repeating codes from that bin if there unstructured control at a < 0.05 level indicated with were fewer than 400. For each of these codes, we selected an asterisk. We assess statistical significance for each bin one patient that had that code in their history, masked using a two-tailed Dunnett’s test comparing mean accuthat code as would be done in MLM, and created a dataset racy scores of experimental HRR configurations against of these 2,800 patients to use for MLM inference. the control unstructured configuration. Notably, the top

Figure 4 and Figure 5, respectively, show the MLM top- 100 accuracy in frequency bin -12 is non-zero for the 10 and Top-100 accuracy on predicting codes in the difer- HRR methods. These codes in the rarest bin occur only

Unstructured HRRBase Frostbite of hand 0.418 Hypothermia, initial encounter

Frostbite of foot 0.361 Hypothermia not with low env. temp.

Drowning and nonfatal submersion 0.352 Efect of reduced temp., initial encounter

Immersion foot 0.341 Other specified efects of reduced temp.

K219-10 - Gastro-esophageal reflux disease without esophagitis

Unstructured HRRBase

Esophageal reflux 0.565 Esophageal reflux Hyperlipidemia, unspecified 0.335 Gastro-eso. reflux d. with esophagitis

Anxiety disorder, unspecified 0.332 Reflux esophagitis Essential (primary) hypertension 0.326 Hypothyroidism, unspecified once in the dataset and therefore have never been used between structured and unstructured embeddings. 30 by the model for gradient updates, since they are in the ICD codes were selected from diferent frequency catevalidation dataset. This suggests that the HRR methods gories in the dataset, with 10 codes drawn randomly from have some ability to provide clinically relevant informa- the 300 most common codes, 10 codes drawn randomly tion about rare codes. However, accuracy with the rarest by weighted frequency from codes appearing fewer than codes remains too low to be of practical value, perhaps 30 times in the dataset, and 10 codes randomly selected due to limited overlap of these codes’ atomic vectors with by weighted frequency from the entire dataset. For each those of more common codes. selected code, the top 4 cosine-similar ICD codes were assessed by a physician for ontological similarity. 3.5. Medical Code Case Study For each frequency category, a one-tailed Fisher’s exact test was conducted to determine whether a relationship Table 2 shows case studies for codes Other and un- existed between embedding type and clinical relatedness. specified hyperlipidemia (2724-9), Hypothermia (9916-9), We found that results in the case of the rare codes were and Gastro-esophageal Reflux disease without esophagi- statistically significant, with = 2.44×10 −8. With 10 rare tis (K219-10). In the first case study for 2724-9, we ob- codes and the top 4 cosine-similar ICD codes selected for serve highly ontologically similar codes, such as Other each rare code, there are 40 top cosine-similar codes in tohyperlipidemia and Hyperlipidemia, unspecified , are en- tal. In the case of unstructured embeddings, only 4 of the coded with high cosine similarity for HRRBase, which top 40 cosine-similar codes were deemed to be strongly is not the case for unstructured embeddings. The co- ontologically related by our physician with the remainoccurrence problem can be seen in the second case study ing codes deemed to be less related and unrelated. In the for 9916-9. The most similar codes for HRRBase are medi- case of our structured HRRBase embeddings, 28 of the cally similar codes that would not usually co-occur, while top 40 cosine-similar codes were deemed to be strongly for unstructured embeddings the most similar codes co- ontologically related by our physician with the remainoccur frequently. For the final case study on K219-10, ing codes deemed to be less related and unrelated. This frequency-related bias can be observed in the unstruc- suggests that knowledge-integrated structured embedtured embeddings with frequent but mostly ontologically dings are associated with greater clinical relevance of the unrelated codes as part of the top list of cosine similar top cosine-similar codes than unstructured embeddings codes, whereas the top list of cosine similar codes for for rare codes where little training data exists. HRRBase contains medically similar codes.

We broadened this case study to test statistical differences in cosine and semantic embedding similarity

4. Discussion

Transformers have leading performance in many applications, but their internal processes are opaque, emerging from enormous parameter sets and data volumes beyond human experience. It is hard to know when they can be trusted. For example, generative transformers are prone to subtle confabulations. Transformers have a general-purpose architecture that performs as well in vision and other modalities as in language. They are a culmination of a key trend in artificial intelligence, away from problem-specific engineering, and toward massive data and computation. This trend is justified in terms of performance. However, given two models with equal performance, one with more explicit conceptual structure is preferable in terms of trust and explainability.

The work presented here is a step in this direction, with our HRRBase embeddings that have explicit conceptual structure and perform equivalently or better compared to typical transformer embeddings. The benefit of structured embeddings becomes more pronounced for tasks that involve codes that are rare or are not present in training data. HRR embeddings can also be relied on to represent medical meaning rather than co-occurrence in the training data. They also untangle the representation of code frequency, so that it can be included or not, and its efects on decisions understood. Importantly, despite this additional structure, the embeddings are thoroughly learned, suggesting that the approach will be consistent with high performance beyond the examples we have studied.

As our method scales with and leverages PyTorch autograd in the construction of the vector-symbolic embeddings, it is compatible with existing medical LLM architectures as an embedding component capable of encoding domain knowledge.

Future work could explore the potential of these structured embeddings for explaining and controlling the observed frequency bias. As HRRs can be queried with linear operations, future work could also explore whether transformers can learn to extract specific information from these composite embeddings. Limitations to address in future work include the complexity of processing knowledge graphs to be compatible with HRRs. Another important limitation is that our method relies on rarecode HRRs sharing atomic elements with common-code HRRs. However, in SNOMED CT, rare codes are likely to contain some rare atomic elements. To address this point, in addition to SNOMED CT, knowledge could be encoded from sources such as pre-trained medical embeddings, diferent medical ontologies, and other medical domain knowledge to further improve our proposed methodology. In LLMs that process both medical codes and text, it would make sense to share word embeddings between modalities. This would allow training of each modality to benefit from training of the other, and may help to align the representations of codes and text.

5. Conclusion We proposed a novel hybrid neural-symbolic approach

called HRR-BERT that integrates medical ontologies represented by HRR embeddings. In tests with the MIMICIV dataset, HRRBERT models modestly outperformed baseline models with unstructured embeddings for pretraining, disease prediction accuracy, mortality prediction F1, and fine-tuning tasks involving infrequently seen codes. HRRBERT models had pronounced performance advantages in MLM with rare codes and disease prediction for patients with no codes seen during training (ROOD - Unseen in Table 1). We also showed that HRRs can be used to create medical code embeddings that better respect ontological similarities for rare codes. A key benefit of our approach is that it facilitates explainability by disentangling token-frequency information, which is prominently represented but implicit in unstructured embeddings. training and scaling, in: International Conference and holistic data representation, in: 2023 Design, on Machine Learning, PMLR, 2023, pp. 2397–2430. Automation Test in Europe Conference Exhibition [8] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, (DATE), 2023, pp. 1–6. doi:10.23919/DATE56975.

H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, 2023.10137134.

S. Pfohl, et al., Large language models encode clini- [20] M. Nickel, L. Rosasco, T. Poggio, Holographic emcal knowledge, Nature 620 (2023) 172–180. beddings of knowledge graphs (2015). URL: http: [9] T. Plate, Holographic reduced representations, IEEE //arxiv.org/abs/1510.04935. doi:10.48550/arXiv.

Transactions on Neural Networks 6 (1995) 623–641. 1510.04935, arXiv:1510.04935 [cs, stat]. doi:10.1109/72.377968. [21] T. Dash, A. Srinivasan, L. Vig, Incor[10] A. Johnson, L. Bulgarelli, T. Pollard, S. Horng, L. A. porating symbolic domain knowledge Celi, R. Mark, Mimic-iv (version 2.0, 2022. URL: into graph neural networks, Machine https://doi.org/10.13026/7vcr-e114. doi:10.13026/ Learning 110 (2021) 1609–1636. URL: 7vcr-e114. https://doi.org/10.1007%2Fs10994-021-05966-z. [11] P. Kanerva, Hyperdimensional computing: An in- doi:10.1007/s10994-021-05966-z. troduction to computing in distributed representa- [22] M. Kulmanov, F. Z. Smaili, X. Gao, R. Hoehntion with high-dimensional random vectors, Cog- dorf, Semantic similarity and machine learning nitive Computation 1 (2009) 139–159. URL: https: with ontologies, Briefings in Bioinformat//api.semanticscholar.org/CorpusID:733980. ics 22 (2020) bbaa199. URL: https://doi.org/ [12] R. W. Gayler, Vector symbolic architectures answer 10.1093/bib/bbaa199. doi:10.1093/bib/bbaa199. jackendof’s challenges for cognitive neuroscience, arXiv:https://academic.oup.com/bib/article2004. arXiv:cs/0412059. pdf/22/4/bbaa199/39132158/bbaa199.pdf. [13] P. Neubert, S. Schubert, Hyperdimensional com- [23] V. Riikka, V. Anne, P. Sari, Systematized nomenclaputing as a framework for systematic aggregation ture of medicine-clinical terminology (snomed ct) of image descriptors (2021). URL: http://arxiv.org/ clinical use cases in the context of electronic health abs/2101.07720. doi:10.48550/arXiv.2101.07720, record systems: Systematic literature review, JMIR arXiv:2101.07720 [cs]. Med Inform (2023). doi:10.2196/43750. [14] P. Neubert, S. Schubert, K. Schlegel, P. Protzel, Vec- [24] L. Song, C. W. Cheong, K. Yin, W. K. Cheung, tor semantic representations as descriptors for vi- B. C. M. Fung, J. Poon, Medical concept embedsual place recognition, in: Robotics: Science and ding with multiple ontological representations, in: Systems XVII, Robotics: Science and Systems Foun- Proceedings of the Twenty-Eighth International dation, 2021. URL: http://www.roboticsproceedings. Joint Conference on Artificial Intelligence, IJCAIorg/rss17/p083.pdf. doi:10.15607/RSS.2021.XVII. 19, International Joint Conferences on Artificial 083. Intelligence Organization, 2019, pp. 4613–4619. [15] A. Rahimi, P. Kanerva, L. Benini, J. M. Rabaey, Efi- URL: https://doi.org/10.24963/ijcai.2019/641. doi:10. cient biosignal processing using hyperdimensional 24963/ijcai.2019/641. computing: Network templates for combined learn- [25] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. ing and classification of exg signals, Proceedings of Hausdorf, P. C. Ivanov, R. G. Mark, J. E. Mietus, the IEEE 107 (2019) 123–143. doi:10.1109/JPROC. G. B. Moody, C.-K. Peng, H. E. Stanley, Phys2018.2871163. ioBank, PhysioToolkit, and PhysioNet: Compo[16] K. Schlegel, P. Neubert, P. Protzel, Hdc- nents of a new research resource for complex physminirocket: Explicit time encoding in time series iologic signals, Circulation 101 (2000) e215–e220. classification with hyperdimensional computing Circulation Electronic Pages: http://circ.ahajour(2022). URL: http://arxiv.org/abs/2202.08055. doi:10. nals.org/content/101/23/e215.full PMID:1085218; 48550/arXiv.2202.08055, arXiv:2202.08055 [cs]. doi: 10.1161/01.CIR.101.23.e215. [17] P. Smolensky, Tensor product variable binding and [ 26 ] J. Devlin, M. Chang, K. Lee, K. Toutanova, the representation of symbolic structures in con- BERT: pre-training of deep bidirectional transnectionist systems, Artificial Intelligence 46 (1990) formers for language understanding, CoRR 159–216. URL: https://www.sciencedirect.com/ abs/1810.04805 (2018). URL: http://arxiv.org/abs/ science/article/pii/000437029090007M. doi:https: 1810.04805. arXiv:1810.04805.

//doi.org/10.1016/0004-3702(90)90007-M. [ 27 ] NLM, Snomed ct to icd-10-cm map, 2022. URL: [18] M. M. Alam, E. Raf, S. Biderman, T. Oates, J. Holt, https://www.nlm.nih.gov/research/umls/mapping_ Recasting self-attention with holographic reduced projects/icd9cm_to_snomedct.html. representations, 2023. arXiv:2305.19534. [ 28 ] OHDSI, Ohdsi standardized vocabularies, 2019. [19] J. Kim, H. Lee, M. Imani, Y. Kim, Eficient hyper- URL: https://github.com/OHDSI/Vocabulary-v5.0/ dimensional learning with trainable, quantizable, wiki.

A.1. Learning through HRR Operations Eficiently A. List of 32 ROOD Codes The following is the list of 32 ROOD codes:

To make the HRR concept embeddings useful for a deep neural network, the operations used to form the embeddings need to be compatible with backpropagation so that gradient descent can update the lower-level atomic vectors. We desired a function that produced the ICD 1. G248-10: Other dystonia concept embedding matrix, , given the inputs of the VSA 2. E8498-9: Accidents occurring in other specified knowledge graphs, , and symbol embedding matrices, places and . 3. E9688-9: Assault by other specified means We attempted three approaches to computing 4. Z681-10: Body mass index (BMI) 19.9 or less, adult through VSA operations. First, we naively tried to compute each concept vector in one at a time. However, 5. 30550-9: Opioid abuse, unspecified this approach was too slow in both forward and back6. R262-10: Dificulty in walking, not elsewhere clas- ward pass, requiring more than 1 second for each pass.

sified Our second approach was using slices of along the re7. E887-9: Fracture, cause unspecified lationship dimension as a sparse binary matrix, which, 8. R471-10: Dysarthria and anarthria when multiplied with , would perform the indexing 9. 9916-9: Hypothermia and summing of atomic vectors for each concept. This 10. E9010-9: Accident due to excessive cold due to result can be convolved with the relationship vector and weather conditions added to the concept embedding matrix. This approach 11. F10129-10: Alcohol abuse with intoxication, un- was much faster and used a moderate amount of memory specified for one of our less complex VSA formulations. However, 12. E8499-9: Accidents occurring in unspecified place when dealing with our most complex formulation, it used 13. R636-10: Underweight ∼15 GB of memory. 14. 920-9: Contusion of face, scalp, and neck except Our final approach took advantage of the fact that eye(s) many disease concepts use relationship, but to diferent 15. R4182-10: Altered mental status, unspecified atomic symbols. Also, number of times a concept uses a particular relationship is relatively low, except for the 16. 95901-9: Head injury, unspecified SNOMED “isA” relationship and our defined “descrip17. 78097-9: Altered mental status tion” relationship. Thus, for a particular relationship, 18. F29-10: Unspecified psychosis not due to a sub- we can contribute to building many disease concept vecstance or known physiological condition tors at once by selecting many atomic vectors, doing a 19. Z880-10: Allergy status to penicillin vectorized convolution with the relationship vector, and 20. Z818-10: Family history of other mental and be- distributing the results to be added with the appropriate havioral disorders concept embedding rows. This step needs to be repeated 21. 81600-9: Closed fracture of phalanx or phalanges at most times for a particular relationship, where is of hand, unspecified the maximum multiplicity of that relationship among all 22. 87341-9: Open wound of cheek, without mention concepts. We improved memory eficiency by performof complication ing fast Fourier transforms (FFTs) on the atomic vector 23. H9222-10: Otorrhagia, left ear embeddings and construct the concept vectors by per24. Z978-10: Presence of other specified devices forming binding via element-wise multiplication in the 25. G20-10: Parkinson’s disease Fourier domain. Due to the linearity of the HRR operations, we performed a final FFT on the complex-valued concept embedding to convert back to the real domain.

The final approach is much faster than the first approach since it takes advantage of vectorized operations to contribute to many concept vectors at once. It is also more memory eficient than the second approach since all the intermediate results are dense, so allocations are not wasted on creating mostly sparse results. On our most complex formulation, this approach uses ∼3.5 GB of memory, and takes ∼80 ms and ∼550 ms for forward and backward pass respectively.

[1]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need, 2017 . arXiv: 1706 . 03762 .

[2]

Rasmy ,

Xiang ,

Xie ,

Tao , Med-bert: pretrained contextualized embeddings on largescale structured electronic health records for disease prediction , npj Digital Medicine 4 ( 2021 ) 86 . doi: 10 .1038/s41746- 021- 00455- y.

[3]

Ganesan ,

Gao ,

Gandhi , E. Raf,

Oates ,

Holt , M. McLean, Learning with holographic reduced representations , CoRR abs/2109 .02157 ( 2021 ). URL: https://arxiv.org/abs/2109.02157. arXiv: 2109 . 02157 .

[4]

M. K.

Sarker ,

Zhou ,

Eberhart ,

Hitzler , Neurosymbolic artificial intelligence, AI Communications 34 ( 2021 ) 197 - 209 .

[5]

Ramgopal ,

L. N.

Sanchez-Pinto ,

C. M.

Horvat ,

M. S.

Carroll ,

Luo ,

T. A.

Florin , Artificial intelligence-based clinical decision support in pediatrics , Pediatric research 93 ( 2023 ) 334 - 341 .

[6]

Yu ,

Tuinstra ,

Hu ,

Rezai ,

Fortin , R. DiMaio,

Vartian ,

Tripp , Frequency bias in mlmtrained bert embeddings for medical codes , CMBES Proceedings 45 ( 2023 ). URL: https://proceedings. cmbes.ca/index.php/proceedings/article/view/1050.

[7]

Biderman ,

Schoelkopf ,

Q. G.

Anthony ,

Bradley , K. O'Brien , E.

Hallahan , M. A.

Khan , S.

Purohit , U. S.

Prashanth , E.

Raf , et al., Pythia: A suite for analyzing large language models across

[29] NLM , Icd-9 -cm diagnostic codes to snomed ct map , 2022 . URL: https://www.nlm.nih.gov/research/ umls/mapping_projects/icd9cm_to_snomedct. html.

[30] NCHS , Diagnosis code set general equivalence mappings, 2018 . URL: https://ftp.cdc.gov/pub/ health_statistics/nchs/Publications/ICD10CM/ 2018/Dxgem_guide_ 2018 .pdf.

[31]

T. J.

Pollard ,

A. E. W.

Johnson ,

J. D.

Rafa ,

L. A.

Celi ,

R. G.

Mark , O. Badawi, The eICU Collaborative Research Database, a freely available multi-center database for critical care research , Scientific data 5 ( 2018 ) 1 - 13 .

26. G249 -10: Dystonia, unspecified

27. 9100 - 9 : Abrasion or friction burn of face, neck, and scalp except eye, without mention of infection

28. 78906 - 9 : Abdominal pain, epigastric

29. E8889 -9: Unspecified fall

30. 30500 - 9 : Alcohol abuse, unspecified

31. G520 -10: Disorders of olfactory nerve

32. 8020 - 9 : Closed fracture of nasal bones