Encoding Medical Ontologies With Holographic Reduced
                                Representations for Transformers
                                Bing Hu1,∗ , Trevor Yu1 , Tia Tuinstra1 , Ryan Rezai1 , Harshit Bokadia1 , Rachel DiMaio1 ,
                                Thomas Fortin1 , Brian Vartian1,2 and Bryan Tripp1
                                1
                                    University of Waterloo, Ontario, Canada
                                2
                                    McMaster University, Ontario, Canada


                                                  Abstract
                                                  Transformer models trained on NLP tasks with medical codes often have randomly initialized embeddings that are then
                                                  adjusted based on training data. For terms appearing infrequently in the dataset, there is little opportunity to improve
                                                  these representations and learn semantic similarity with other concepts. Medical ontologies represent many biomedical
                                                  concepts and define a relationship structure between these concepts, making ontologies a valuable source of domain-specific
                                                  information. Holographic Reduced Representations (HRR) are capable of encoding ontological structure by composing atomic
                                                  vectors to create structured higher-level concept vectors. We developed an embedding layer that generates concept vectors for
                                                  clinical diagnostic codes by applying HRR operations that compose atomic vectors based on the SNOMED CT ontology. This
                                                  approach allows for learning the atomic vectors while maintaining structure in the concept vectors. We trained a Bidirectional
                                                  Encoder Representations from the Transformers (BERT) model to process sequences of clinical diagnostic codes and used the
                                                  resulting HRR concept vectors as the embedding matrix for the model. The HRR-based approach introduced interpretable
                                                  structure into code embeddings while maintaining or modestly improving performance on the masked language modeling
                                                  (MLM) pre-training task (particularly for rare codes) as well as the fine-tuning tasks of mortality and disease prediction. This
                                                  approach also better maintains semantic similarity between medically related concept vectors, due to both shared atomic
                                                  vectors and disentangling of code-frequency information.

                                                  Keywords
                                                  Deep Learning, Ontology, Knowledge-Integration


                                1. Introduction                                                                                            in medical applications [5]. Standard large language mod-
                                                                                                                                           els (LLMs) can be prone to biases in the training data,
                                Transformers [1] jointly optimize high-dimensional vec-                                                    such as frequency bias, which can result in medical mis-
                                tor embeddings that represent input tokens, and a net-                                                     information and potentially clinical harm [6, 7, 8].
                                work that contextualizes and transforms these embed-                                                          Here we use a novel neuro-symbolic medical trans-
                                dings to perform a task. Originally designed for natu-                                                     former architecture incorporating structured knowledge
                                ral language processing (NLP) tasks, transformers are                                                      from an authoritative medical ontology into the embed-
                                now widely used with other data modalities. In medical                                                     dings. Specifically, we use vector-symbolic holographic
                                applications, one important modality consists of medi-                                                     reduced representations (HRRs) [9] to produce composite
                                cal codes that are extensively used in electronic health                                                   medical-code embeddings and backpropagate through
                                records (EHR). A prominent example in this space is Med-                                                   the architecture to optimize the embeddings of atomic
                                BERT [2], which consumes a sequence of diagnosis codes.                                                    concepts. This approach produces optimized medical
                                Tasks that Med-BERT and other EHR-transformers per-                                                        code embeddings with an explicit structure that incorpo-
                                form include disease and mortality prediction.                                                             rates medical knowledge.
                                   Deep networks have traditionally been alternatives to                                                      We test our method, Holographic Reduced Representa-
                                symbolic artificial intelligence with different advantages                                                 tion Bi-directional Encoder Representations from Trans-
                                [3]. Deep networks use real-world data effectively, but                                                    formers (HRRBERT), on the Medical Information Mart
                                symbolic approaches have completive properties, such as                                                    for Intensive Care (MIMIC)-IV dataset [10] and show im-
                                better transparency and capacity for incorporating struc-                                                  provements in both pre-training and fine-tuning tasks.
                                tured information, inspiring many efforts to combine the                                                   We also show that our embeddings of ontologically sim-
                                two approaches in neuro-symbolic systems [4]. Addi-                                                        ilar rare medical codes have high cosine similarity, in
                                tional transparency and ability to incorporate structured                                                  contrast with embeddings that are learned in the stan-
                                information are potential benefits of symbolic approaches                                                  dard way. Finally, we investigate learned representations
                                                                                                                                           of medical-code frequency, in light of recent demonstra-
                                KiL’24: Workshop on Knowledge-infused Learning co-located with
                                30th ACM KDD Conference, August 26, 2024, Barcelona, Spain
                                                                                                                                           tion of frequency bias in EHR-transformers [6].
                                ∗
                                     Corresponding author.                                                                                    We contribute:
                                Envelope-Open bingxu.hu@uwaterloo.ca (B. Hu)
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License        • A novel neuro-symbolic architecture, HRRBERT,
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
       that combines vector-symbolic embeddings with        upon Smolensky’s by using circular convolution as the
       the BERT LLM architecture, leading to better per-    binding operator [9]. Circular convolution keeps the
       formance in medical tasks.                           output in the same dimension, solving the problem of
     • Efficient construction of vector-symbolic embed-     exploding dimensionality.
       dings that leverage PyTorch autograd on GPUs.           In the field of deep learning, HRRs have been used
     • Optimized medical-code embeddings that better        in previous work to recast self-attention for transformer
       respect semantic similarity of medical terminol-     models [18], to improve the efficiency of neural networks
       ogy than standard embeddings for infrequently        performing a multi-label classification task by using an
       used codes.                                          HRR-based output layer [3], and as a learning model itself
                                                            with a dynamic encoder that is updated through train-
   We focus here on processing medical codes, but our       ing [19]. In all of these works, the efficiency and simple
methods would extend naturally to foundation models         arithmetic of HRRs are leveraged. Our work differs in
that combine medical codes and natural language. Specif-    that we also leverage the ability of HRRs to create struc-
ically, the trained atomic vectors of our vector-symbolic   tured vectors to represent complex concepts as inputs to
embeddings could share a dictionary with language em-       a transformer model.
beddings, so that training of each could improve the           VSAs such as HRRs can effectively encode domain
representation of the other.                                knowledge, including complex concepts and the relation-
                                                            ships between them. For instance, Nickel et al. [20] pro-
1.1. Background and Related Works                           pose holographic embeddings that make use of VSA prop-
                                                            erties to learn and represent knowledge graphs. Encod-
The Vector-Symbolic Architectures (VSA) approach is a       ing domain knowledge is of interest in the field of deep
computing paradigm that relies on high dimensionality       learning, as it could improve, for example, a deep neural
and randomness to represent concepts as unique vectors      network’s ability to leverage human knowledge and to
in a high dimensional space [11]. VSAs create and manip-    communicate its results within a framework that humans
ulate distributed representations of concepts by combin-    understand [21]. Ontologies are a form of domain knowl-
ing base vectors with bundling, binding, and permutation    edge incorporated into machine learning models to use
algebraic operators [12]. For example, a scene with a red   background knowledge to create embeddings with mean-
box and a green ball could be described with the vector     ingful similarity metrics and for other purposes [22]. In
SCENE=RED⊗BOX+GREEN⊗BALL, where ⊗ indicates                 our work, we use HRRs to encode domain knowledge
binding, and + indicates bundling. The atomic concepts      in trainable embeddings for a transformer model. The
of RED, GREEN, BOX, and BALL are represented by base        domain knowledge we use comes from the Systematized
vectors, which are typically random. VSAs also define       Nomenclature of Medicine Clinical Terms (SNOMED CT),
an inverse operation that allows the decomposition of a     which is a widely used clinical ontology system that in-
composite representation. For example, the scene rep-       cludes definitions of relationships between clinical con-
resentation could be queried as SCENE⊗BOX−1 . This          cepts [23].
should return the representation of GREEN or an approx-        To the best of our knowledge, HRRs have not been used
imation of GREEN that is identifiable when compared to      before as embeddings for transformer models. Trans-
a dictionary. In a VSA, the similarity between concepts     former models typically use learned embeddings with
can be assessed by measuring the distance between the       random initializations [1]. However, in the context of rep-
two corresponding vectors.                                  resenting ontological concepts, using such unstructured
   VSAs were proposed to address challenges in mod-         embeddings can have undesirable effects. One problem
elling cognition, particularly language [12]. However,      is the inconsistency between the rate of co-occurrence
VSAs have been successfully applied across a variety of     or patterns of occurrence of medical concepts and their
domains and modalities outside of the area of language      degree of semantic similarity described by the ontology.
as well, including in vision [13, 14], biosignal process-   For example, the concepts of “Type I Diabetes” and “Type
ing [15], and time-series classification [16]. Regardless   II Diabetes” are mutually exclusive in EHR data and do
of the modality or application, VSAs provide value by       not follow the same patterns of occurrence due to dif-
enriching vectors with additional information, such as      ferences in pathology and patient populations [24]. The
spatial semantic information in images and global time      differences in occurrence make it difficult for a trans-
encoding in time series.                                    former model to learn embeddings with accurate simi-
   An early VSA framework was Smolensky’s Tensor            larity metrics. The concepts should have relatively high
Product Representation [17], which addressed the need       similarity according to the ontology. They both share a
for compositionality, but suffered from exploding model     common ancestor of “Diabetes Mellitus,” they are both
dimensionality. The VSA framework introduced by Plate,      metabolic disorders that affect blood glucose levels, and
Holographic Reduced Representations (HRR), improved         they can both lead to similar health outcomes. Song et al.
[24] seeks to address this type of inconsistency by train-    [26], including the atomic vectors for HRR embeddings.
ing multiple “multi-sense” embeddings for each non-leaf       Fine-tuning used a constant learning rate schedule with a
node in an ontology’s knowledge graph via an attention        weight decay of 4e-6. Fine-tuning lasted 10 epochs with
mechanism. However, the “multi-sense” embeddings do           a batch size of 80.
not address the learned frequency-related bias that also
arises from the co-occurrence of concepts. Frequency-         2.3. Encoding SNOMED Ontology with
related bias raises an explainability issue, as it leads to
learned embeddings that do not reflect true similarity
                                                                   HRR Embeddings
relationships between concepts, for example, as defined       In this section, we detail the methodologies of construct-
in an ontology, but instead reflect the frequency of the      ing vector embeddings for ICD disease codes using HRR
concepts in the dataset [6]. This bias particularly affects   operations based on the SNOMED CT structured clini-
codes that are used less frequently.                          cal vocabulary. We first describe our mapping from ICD
   Our proposed approach, HRRBERT, uses the structure         concepts to SNOMED CT terms. Next, we define how
from SNOMED CT to represent thousands of concepts             the atomic symbols present in the SNOMED CT ontology
with high-dim-ensional vectors such that each vector          are combined using HRR operations to construct con-
reflects a particular clinical meaning and can be compared    cept vectors for the ICD codes. Finally, we describe our
to other vectors using the HRR similarity metric, cosine      method to efficiently compute the HRR embedding ma-
similarity. It also leverages the computing properties of     trix using default PyTorch operations that are compatible
HRRs to provide structured embeddings for a LLM that          with autograd.
supports optimization through backpropagation.
                                                              2.3.1. Mapping ICD to SNOMED CT Ontology
2. Methods                                                    Our data uses ICD-9 and ICD-10 disease codes while
                                                              our symbolic ontology is defined in SNOMED CT, so we
2.1. MIMIC-IV Dataset                                         required a mapping from the ICD to the SNOMED CT
                                                              system to build our symbolic architecture. We used the
The data used in this study was derived from the Med-
                                                              SNOMED CT International Release from May 31, 2022
ical Information Mart for Intensive Care (MIMIC) v2.0
                                                              [23] and only included SNOMED CT terms that were
database, which is composed of de-identified EHRs from
                                                              active at the time of that release. While SNOMED pub-
in-patient hospital visits between 2008 and 2019 [10].
                                                              lishes a mapping tool from SNOMED CT to ICD-10, a
MIMIC-IV is available through PhysioNet [25]. We used
                                                              majority of ICD-10 concepts have one-to-many mappings
the ICD-9 and ICD-10 diagnostic codes from the icd_di-
                                                              in the ICD-to-SNOMED CT direction [27]. To increase
agnosis table from the MIMIC-IV hosp module. We fil-
                                                              the fraction of one-to-one mappings, we used additional
tered patients who did not have at least one diagnostic
                                                              published mappings from the Observational Medical Out-
code associated with their records. Sequences of codes
                                                              comes Partnership (OMOP) [28], mappings from ICD-9
were generated per patient by sorting their hospital visits
                                                              directly to SNOMED CT [29], and mappings from ICD-10
by time. Within one visit, the order of codes from the
                                                              to ICD-9 [30].
MIMIC-IV database was used, since it represents the rel-
                                                                 Notably, after excluding ICD codes with no active
ative importance of the code for that visit. Each unique
                                                              SNOMED CT mapping, 671 out of the 26,164 unique
code was assigned a token. In total, there were 189,980
                                                              ICD codes in the MIMIC-IV dataset were missing map-
patient records in the dataset. We used 174,890 patient
                                                              pings. When those individual codes were removed, a
records for pre-training, on which we performed a 90–10
                                                              data volume of 4.62% of codes was lost. This removed 58
training-validation split. We reserved 15k records for
                                                              out of 190,180 patients from the dataset, as they had no
fine-tuning tasks.
                                                              valid ICD codes in their history. Overall, the remaining
                                                              25,493 ICD codes mapped to a total of 12,263 SNOMED
2.2. Model Architecture                                       CT terms.
We utilized a BERT-base model architecture with a post-
layer norm position and a sequence length of 128 ICD          2.3.2. SNOMED CT vector symbolic architecture
codes [26]. A custom embedding class was used to sup-         Next, we define how the contents of the SNOMED CT
port the functionality required for our HRR embeddings.       ontology were used to construct a symbolic graph to
We adapted the BERT segment embeddings to represent           represent ICD concepts. For a given SNOMED CT term,
groups of codes from the same hospital visit, using up        we used its descriptive words and its relationships to
to 100 segment embeddings to encode visit sequencing.         other SNOMED CT terms. A relationship is defined by
An embedding dimension of 𝑑 = 768 was used, and all           a relationship type and a target term. In total, there
embeddings were initialized from 𝑥 ∼ 𝒩𝑑 (0, 0.02), as in
were 13,852 SNOMED CT target terms and 40 SNOMED                autograd to learn through these HRR operations are pro-
CT relationship types used to represent all desired ICD         vided in Appendix A.1.
concepts. In the ontology, many ICD concepts share
SNOMED CT terms in their representations.                       2.3.3. Embedding Configurations
   The set of relationships was not necessarily unique
for each SNOMED CT term. To add more unique in-                    We call our method of constructing embeddings for ICD
formation, we used a term’s “fully specified name” and             codes purely from HRR representations “HRRBase” and
any “synonyms” as an additional set of words describing            the standard method of creating transformer token em-
that term. We set all text to lowercase, stripped punctua-         beddings from random vectors “unstructured”. While the
tion, and split on spaces to create a vocabulary of words.         HRRBase configuration enforces the ontology structure,
We removed common English stopwords from a custom                  we wondered whether it would be too rigid and have dif-
stopword list that was collected with assistance from a            ficulty representing information not present in SNOMED
medical physician. The procedure resulted in a total of            CT. As dataset frequency information for ICD medical
8833 vocabulary words.                                             codes is not present in the HRR structure, we tried adding
   Overall, there were a total of 22,725 “atomic” symbols          an embedding that represented the empirical frequency
for the VSA which included the SNOMED CT terms, rela-              of that ICD code in the dataset. We also tried adding fully
tionships, and the description vocabulary. Each symbol             learnable embeddings with no prior structure.
was assigned an “atomic vector”. We built a “concept vec-             Given the wide range of ICD code frequencies in
tor” for each of the target 25,493 ICD codes using HRR             MIMIC, we log-transformed the empirical ICD code fre-
operations to combine atomic vectors according to the              quencies, and then discretized the resulting range. For
SNOMED CT ontology structure.                                      our HRRFreq configuration, we used the sinusoidal fre-
   To build a 𝑑-dimensional concept vector for a given             quency encoding as in [1] to encode the discretized log-
ICD concept, we first considered the set of all relation-          frequency information. The frequency embeddings were
ships that the concept maps to. We used the HRR opera-             normalized before being summed with the HRR embed-
tor for binding, circular convolution, to combine vectors          ding vectors.
representing the relationship type and destination term               We defined two additional configurations in which
and defined the concept vector to be the bundling of               a standard embedding vector was integrated with the
these bound relationships. For the description words,              structured HRR concept vector. With “HRRAdd”, a learn-
we bundled the vectors representing each word together             able embedding was added to the concept embedding,
and bound this result with a new vector representing the           HRRAdd = 𝐶 + 𝐿add , 𝐿add ∈ ℝ𝑁𝑐 ×𝑑 . However, this roughly
relationship type “description,” as shown in Equation 1.           doubled the number of learnable parameters compared
                                                                   to other formulations.
𝑥ICD concept =        ∑       𝑥rel ⊛𝑥term + ∑ 𝑥desc ⊛𝑥word            With “HRRCat”, a learnable embedding of dimension
                  SNOMED CT                words                   𝑑/2 was concatenated with the HRR concept embed-
                                                             (1) ding of dimension 𝑑/2. This keeps the total number
   Formally, let 𝔸 ∶ {1, 2, ..., 𝑁𝑎 } be the set of integers of learnable parameters roughly the same as the unstruc-
enumerating the unique atomic symbols for SNOMED tured configuration (25,493 𝑑-dimensional vectors) and
CT terms and description words. Let 𝔹 ∶ {1, 2, ..., 𝑁𝑟 } be the HRRBase configuration (22,725 𝑑-dimensional vec-
the set of integers enumerating unique relationships for tors). The final embedding matrix was defined as HRRCat
SNOMED CT terms, including the description relation- = [𝐶 𝐿cat ], where 𝐶, 𝐿cat ∈ 𝑅𝑁𝑐 ×𝑑/2 .
ship and the binding identity. Let 𝔻 ∶ {1, 2, ..., 𝑁𝑐 } be the
set of integers enumerating the ICD-9 and ICD-10 disease
concepts represented by the VSA.                                   2.4. Experiments
   𝔸 has an associated embedding matrix 𝐴 ∈ ℝ             𝑁 𝑎 ×𝑑 , We pre-trained the unstructured, HRRBase, HRRCat, and
where atomic vector 𝑎𝑘 = 𝐴[𝑘,∶] , 𝑘 ∈ 𝔸 is the 𝑘-th row HRRAdd embedding configurations of HRRBERT on the
the embedding matrix. Similarly, there is relationship masked language modelling (MLM) task, for 3 trials each.
embedding matrix, 𝑅 ∈ ℝ𝑁𝑟 ×𝑑 and 𝑟𝑗 = 𝑅[𝑗,∶] , 𝑗 ∈ 𝔹; For each of the 3 pre-trained models, 10 fine-tuning trials
and an ICD concept embedding matrix, 𝐶 ∈ ℝ𝑁𝑐 ×𝑑 and were conducted for a total of 30 trials per fine-tuning task.
𝑐𝑖 = 𝐶[𝑖,∶] , 𝑖 ∈ 𝔻. We described the VSA with the formula The best checkpoint from the 10 epochs of fine-tuning
in Equation 2, where 𝒢𝑖 is a graph representing the con- was saved based on validation performance. A test set
nections between ICD concept 𝑖 to atomic symbols 𝑘 by containing 666 patient records was used to evaluate each
relationship 𝑗.                                                    of the fine-tuned models for both mortality and disease
                      𝑐𝑖 = ∑ 𝑟𝑗 ⊛ 𝑎𝑘                         (2) prediction. We report accuracy, precision, recall, and
                           (𝑗,𝑘)∈𝒢𝑖                                F1 scores averaged over the 30 trials for the fine-tuning
   Additional details on how to efficiently use PyTorch tasks.
3. Experimental Results                                        date. A training set of 13k patient records along with a
                                                               validation set of 2k patient records were used to fine-tune
3.1. Pre-training                                              each model on mortality prediction. Table 1 shows the
                                                               evaluation results of mortality prediction for each of the
                                                               configurations. We performed a two-sided Dunnett’s test
                                                               to compare our multiple experimental HRR embedding
                                                               configurations to the control unstructured embeddings,
                                                               with 𝑝 < 0.05 significance level. HRRBase embeddings
                                                               had a significantly greater mean F1-score (𝑝 = 0.043)
                                                               and precision (𝑝 = 0.042) compared to unstructured em-
                                                               beddings.

                                                               3.2.2. Disease Prediction Task
                                                            The disease prediction task is defined as predicting which
Figure 1: Pre-training validation set evaluation results fordisease chapters were recorded in the patient’s last visit
different configurations                                    using information from earlier visits. We converted all
                                                            ICD codes in a patient’s last visit into a multi-label bi-
   MLM accuracy is evaluated on a validation set over the nary vector of disease chapters. As there are 22 disease
course of pre-training. Pre-training results for different chapters defined in ICD-10, the multi-label binary vector
configurations are shown in Figure 1. The pre-training has a size of 22 with binary values corresponding to the
results are averaged over 3 runs for each of the configu- presence of a disease in each chapter. A training set of
rations except for HRRFreq where only 1 model run was 4.5k patient records along with a validation set of 500
completed.                                                  patient records were used to fine-tune each model on
   The baseline of learned unstructured embeddings has this task. Table 1 shows the evaluation results of disease
a peak pre-training validation performance of around prediction for each of the configurations. For the two-
33.4%. HRRBase embeddings perform around 17% worse sided Dunnett test, Levene’s test shows that the equal
compared to the baseline of learned unstructured embed- variance condition is satisfied, and the Shapiro-Wilk test
dings. We hypothesize that this decrease in performance suggests normal distributions except for HRRAdd accu-
is due to a lack of embedded frequency information in racy. The test showed HRRBase embeddings had a signif-
HRRBase compared to learned unstructured embeddings. icantly greater mean accuracy (𝑝 = 0.033) and precision
HRRFreq (which combines SNOMED CT information (𝑝 = 0.023) compared to unstructured embeddings. No
with frequency information) has a similar performance other comparisons of mean metrics for HRR embeddings
compared to unstructured embeddings, supporting this were significantly greater than the control.
hypothesis. Compared to baseline, HRRAdd and HRRCat
improve pre-training performance by a modest margin of 3.2.3. eICU Mortality Prediction
around 2%. We posit that this almost 20% increase in per-
                                                            An additional experiment conducted on the Philips Elec-
formance of HRRCat and HRRAdd over HRRBase during
                                                            tronic Intensive Care Unit (eICU) [31] shows corrobo-
pre-training is partly due to the fully learnable embed-
                                                            rating results with the MIMIC-IV experiments. For our
ding used in HRRCat and HRRAdd learning frequency
                                                            experiment, we applied our mortality prediction models
information.
                                                            that were fine-tuned on MIMIC-IV to eICU data to see
                                                            if our results generalize. Table 1 shows that HRRBase
3.2. Fine-tuning                                            embeddings had a significantly greater mean accuracy
                                                            (𝑝 = 0.046) compared to unstructured embeddings when
We fine tuned the networks for mortality prediction and
                                                            applied to the eICU dataset. These models are not opti-
disease prediction. Across metrics and tasks, the best
                                                            mized for mortality prediction for other hospitals where
results were often seen in HRRBase (Table 1) with some
                                                            coding methodology and clinical practice may differ. For
being statistically significant.
                                                            example, the most common code in the eICU dataset
                                                            represents acute respiratory failure, whereas the most
3.2.1. Mortality Prediction Task                            common code in the MIMIC-IV dataset represents hyper-
The mortality prediction task is defined as predicting tension.
patient mortality within 6 months after the last visit. Bi-
nary mortality labels were generated by comparing the
time difference between the last visit and the mortality
Table 1
Finetuning mean test scores and standard deviations for mortality prediction, disease prediction, eICU mortality prediction,
and both Really-Out-Of-Distribution (ROOD) Unseen and Overall disease prediction tasks. The best scores are bolded and are
underlined if statistically significant.

                 Finetuning Task      Configuration      Accuracy      Precision      Recall     F1-Score
                    ROOD              HRRBase            94.9±1.0       83.5±4.6     76.8±5.1    79.5±4.9
                    Unseen            Unstructured        92.3±0.3      46.2±0.0     50.0±0.1    48.0±0.1
                    ROOD              HRRBase            81.9±0.1       78.3±0.3     75.2±0.8    76.4±0.5
                    Overall           Unstructured        81.9±0.2      78.7±0.7     74.4±1.2    76.0±0.8
                                      HRRBase            84.4±2.3       65.8±2.0     85.6±2.2    69.2±2.7
                   Mortality          HRRAdd              84.0±2.2      65.7±1.9     85.7±2.3    68.9±2.5
                   Prediction         HRRCat              83.9±2.3      65.6±1.7     84.9±2.8    68.8±2.5
                                      Unstructured        83.4±1.9      64.9±1.2     84.6±2.2    67.9±1.8
                                      HRRBase            79.9±0.5       73.0±1.2     67.2±0.7    69.0±0.6
                   Disease            HRRAdd              79.6±0.7      72.6±1.4     67.3±0.9    69.0±0.6
                   Prediction         HRRCat              79.6±0.8      72.5±1.7     67.3±1.0    68.9±0.8
                                      Unstructured        79.4±0.5      72.1±1.1     67.8±1.0    69.2±0.7
                   eICU               HRRBase            68.9±1.3       75.0±1.8     57.0±5.8    64.5±3.5
                   Mortality          HRRAdd              68.1±1.6      74.0±2.2     56.2±6.8    63.6±3.9
                   Prediction         HRRCat              68.2±1.2      73.8±2.6     57.0±7.2    64.0±3.7


3.2.4. Really-Out-Of-Distribution (ROOD) Disease               on other codes. Unstructured embeddings cannot learn
       Prediction                                              better representations for codes never seen in training.
We conducted an additional disease-prediction experi-
ment to test generalization to patients with codes outside     3.3. t-SNE of Frequency Bias
the training distribution. We found six patients with
records that consisted of only 32 codes between them
(see list of codes in Appendix A). We created a really-
out-of-distribution (ROOD) dataset that consisted of all
patients in MIMIC-IV (nearly 30K) with at least one of
these codes. We used this as a validation set. The sepa-
rate pre-training and fine-tuning dataset did not contain      Figure 2: Comparing t-SNE of (a) unstructured embeddings,
these codes. We also created a smaller validation dataset      (b) HRRAdd, (c) HRRCat, and (d) HRRBase. The t-SNE graphs
consisting of the six patients with only these codes. Dur-     are color-coded by the frequency of the ICD codes in the
ing pretraining, the HRRBase and unstructured models           dataset - highly frequent codes are colored blue while infre-
did not encounter any examples using the 32 ROOD codes         quent codes are colored red.
and so did not explicitly learn representations for those
codes. The trained models were then tested using the              We computed t-SNE dimension reductions to visual-
ROOD dataset.                                                  ize relationships among ICD code embeddings in the
   Results from Table 1 on ROOD dataset disease predic-        pre-trained models. Figure 2 shows that unstructured
tion show that HRRBase outperforms the unstructured            embeddings of common ICD codes are clustered together
embedding model for contexts of entirely unseen codes.         with a large separation from those of uncommon codes.
We assess statistical significance using two-tailed, in-       This suggests that code-frequency information is promi-
dependent t-test with unequal variance, as some mea-           nently represented in these embeddings, consistent with
surements failed Levene’s test for equal variance. The         frequency bias in related models [6]. Common and un-
means of all the metrics for HRRBase are significantly         common code clusters are less distinct in HRRBase, which
greater than for unstructured when making inferences           does not explicitly encode frequency information.
on patients with entirely unseen codes, 𝑝 < 0.001 for all         As shown in Figure 1, adding code-frequency infor-
metrics. Given the embedded ontological structure, we          mation to the structured HRRBase embeddings, i.e. the
hypothesize that HRRBase implicitly learns useful em-          HRRFreq embeddings, improved the pre-training loss be
beddings for the 32 unseen ROOD codes by learning any          similar to unstructured embeddings. This suggests that
shared embedding components of the VSA when training           unstructured components in HRRAdd and HRRCat may
Figure 3: t-SNE representation of sinusoidal frequency em-
beddings (left), and unstructured embedding components of
HRRAdd (middle) and HRRCat (right).


have learned some frequency information, since these
losses are also similar to the loss of models with Unstruc-
tured embeddings. To investigate whether this occurred,
we performed t-SNE dimension reductions of the unstruc-
tured components of HRRAdd and HRRCat and colored
the points by code frequency, shown in Figure 3. This         Figure 4: The top-10 MLM accuracy for binned code frequen-
                                                              cies in log scale. Common codes are in frequency bin 0 with
graph suggests that these additional unstructured em-
                                                              rarest codes being in frequency bin -12. 0.05, 0.01, and 0.001
beddings learn some frequency information, due to clus-       significance levels comparing to unstructured embeddings are
tering of high frequency codes. However, the frequency        indicated with 1, 2, and 3 asterisks respectively. Note that
information learned by HRRCat and HRRAdd learnable            HRRBase is expected to perform poorly in this test due to lack
embeddings influence overall embeddings less strongly         of code-frequency information.
in comparison to unstructured embeddings as seen in Fig-
ure 2, where low frequency embeddings are less distinctly
separated from higher frequency embeddings.

3.4. Top-k Accuracy for MLM
Accurately predicting infrequently used disease codes
is an important clinically relevant task. Given that the
model trains and sees more common codes compared
to rare codes, rare codes are naturally challenging to
predict. Through promising empirical results on out-of-
distribution mortality prediction for eICU and disease
prediction on ROOD, we hypothesized that our HRR em-
bedding models should have improved accuracy when
predicting rare codes in the dataset compared to unstruc-
tured embedding models, since rare codes should share
some atomic vectors in their representations with com-
mon codes.                                                    Figure 5: The top-100 MLM accuracy for binned code frequen-
   To test this, we evaluated the accuracy of an MLM          cies in log scale. Common codes are in frequency bin 0 with
pre-trained model predicting a single masked code of a        rarest codes being in frequency bin -12. 0.05, 0.01, and 0.001
known frequency. We split the codes in the pre-training       significance levels comparing to unstructured embeddings are
validation dataset into 7 bins from log frequency -14 to 0,   indicated with 1, 2, and 3 asterisks respectively.
such that each bin has a width of 2. The most common
codes are in a bin with log frequencies between -2 and 0,
while the rarest codes are from a bin with log frequencies    ent frequency bins, averaged across the three pre-training
between -14 and -12. From each bin, we selected 400           models per configuration. Significant comparisons to the
codes at random, repeating codes from that bin if there       unstructured control at a 𝑝 < 0.05 level indicated with
were fewer than 400. For each of these codes, we selected     an asterisk. We assess statistical significance for each bin
one patient that had that code in their history, masked       using a two-tailed Dunnett’s test comparing mean accu-
that code as would be done in MLM, and created a dataset      racy scores of experimental HRR configurations against
of these 2,800 patients to use for MLM inference.             the control unstructured configuration. Notably, the top-
   Figure 4 and Figure 5, respectively, show the MLM top-     100 accuracy in frequency bin -12 is non-zero for the
10 and Top-100 accuracy on predicting codes in the differ-    HRR methods. These codes in the rarest bin occur only
Table 2
Three cosine similarity case studies looking at related ICD codes for unstructured and HRRBase. The top 4 cosine-similar ICD
codes to the chosen code are listed (most to least similar) with their full description and similarity value.

                                      2724-9 - Other and unspecified hyperlipidemia
                             Unstructured                                        HRRBase
                 Pure hypercholesterolemia         0.542              Other hyperlipidemia               1.000
                Hyperlipidemia, unspecified        0.482          Hyperlipidemia, unspecified            1.000
                     Esophageal reflux             0.304           Pure hypercholesterolemia             0.463
                    Anemia, unspecified            0.279             Mixed hyperlipidemia                0.418
                                                  9916-9 - Hypothermia
                             Unstructured                                        HRRBase
                      Frostbite of hand            0.418         Hypothermia, initial encounter          0.794
                      Frostbite of foot            0.361     Hypothermia not with low env. temp.         0.592
             Drowning and nonfatal submersion 0.352 Effect of reduced temp., initial encounter           0.590
                       Immersion foot              0.341    Other specified effects of reduced temp.     0.590
                              K219-10 - Gastro-esophageal reflux disease without esophagitis
                             Unstructured                                        HRRBase
                     Esophageal reflux             0.565               Esophageal reflux                 0.635
                Hyperlipidemia, unspecified        0.335      Gastro-eso. reflux d. with esophagitis     0.512
                Anxiety disorder, unspecified      0.332               Reflux esophagitis                0.512
              Essential (primary) hypertension     0.326          Hypothyroidism, unspecified            0.268


once in the dataset and therefore have never been used         between structured and unstructured embeddings. 30
by the model for gradient updates, since they are in the       ICD codes were selected from different frequency cate-
validation dataset. This suggests that the HRR methods         gories in the dataset, with 10 codes drawn randomly from
have some ability to provide clinically relevant informa-      the 300 most common codes, 10 codes drawn randomly
tion about rare codes. However, accuracy with the rarest       by weighted frequency from codes appearing fewer than
codes remains too low to be of practical value, perhaps        30 times in the dataset, and 10 codes randomly selected
due to limited overlap of these codes’ atomic vectors with     by weighted frequency from the entire dataset. For each
those of more common codes.                                    selected code, the top 4 cosine-similar ICD codes were
                                                               assessed by a physician for ontological similarity.
3.5. Medical Code Case Study                                      For each frequency category, a one-tailed Fisher’s exact
                                                               test was conducted to determine whether a relationship
Table 2 shows case studies for codes Other and un-             existed between embedding type and clinical relatedness.
specified hyperlipidemia (2724-9), Hypothermia (9916-9),       We found that results in the case of the rare codes were
and Gastro-esophageal Reflux disease without esophagi-         statistically significant, with 𝑝 = 2.44×10−8 . With 10 rare
tis (K219-10). In the first case study for 2724-9, we ob-      codes and the top 4 cosine-similar ICD codes selected for
serve highly ontologically similar codes, such as Other        each rare code, there are 40 top cosine-similar codes in to-
hyperlipidemia and Hyperlipidemia, unspecified, are en-        tal. In the case of unstructured embeddings, only 4 of the
coded with high cosine similarity for HRRBase, which           top 40 cosine-similar codes were deemed to be strongly
is not the case for unstructured embeddings. The co-           ontologically related by our physician with the remain-
occurrence problem can be seen in the second case study        ing codes deemed to be less related and unrelated. In the
for 9916-9. The most similar codes for HRRBase are medi-       case of our structured HRRBase embeddings, 28 of the
cally similar codes that would not usually co-occur, while     top 40 cosine-similar codes were deemed to be strongly
for unstructured embeddings the most similar codes co-         ontologically related by our physician with the remain-
occur frequently. For the final case study on K219-10,         ing codes deemed to be less related and unrelated. This
frequency-related bias can be observed in the unstruc-         suggests that knowledge-integrated structured embed-
tured embeddings with frequent but mostly ontologically        dings are associated with greater clinical relevance of the
unrelated codes as part of the top list of cosine similar      top cosine-similar codes than unstructured embeddings
codes, whereas the top list of cosine similar codes for        for rare codes where little training data exists.
HRRBase contains medically similar codes.
   We broadened this case study to test statistical dif-
ferences in cosine and semantic embedding similarity
4. Discussion                                                  to benefit from training of the other, and may help to
                                                               align the representations of codes and text.
Transformers have leading performance in many applica-
tions, but their internal processes are opaque, emerging
from enormous parameter sets and data volumes beyond           5. Conclusion
human experience. It is hard to know when they can
be trusted. For example, generative transformers are           We proposed a novel hybrid neural-symbolic approach
prone to subtle confabulations. Transformers have a            called HRR-BERT that integrates medical ontologies rep-
general-purpose architecture that performs as well in          resented by HRR embeddings. In tests with the MIMIC-
vision and other modalities as in language. They are a         IV dataset, HRRBERT models modestly outperformed
culmination of a key trend in artificial intelligence, away    baseline models with unstructured embeddings for pre-
from problem-specific engineering, and toward massive          training, disease prediction accuracy, mortality predic-
data and computation. This trend is justified in terms of      tion F1, and fine-tuning tasks involving infrequently seen
performance. However, given two models with equal per-         codes. HRRBERT models had pronounced performance
formance, one with more explicit conceptual structure is       advantages in MLM with rare codes and disease pre-
preferable in terms of trust and explainability.               diction for patients with no codes seen during training
   The work presented here is a step in this direction, with   (ROOD - Unseen in Table 1). We also showed that HRRs
our HRRBase embeddings that have explicit conceptual           can be used to create medical code embeddings that bet-
structure and perform equivalently or better compared          ter respect ontological similarities for rare codes. A key
to typical transformer embeddings. The benefit of struc-       benefit of our approach is that it facilitates explainability
tured embeddings becomes more pronounced for tasks             by disentangling token-frequency information, which
that involve codes that are rare or are not present in         is prominently represented but implicit in unstructured
training data. HRR embeddings can also be relied on to         embeddings.
represent medical meaning rather than co-occurrence in
the training data. They also untangle the representation       References
of code frequency, so that it can be included or not, and
its effects on decisions understood. Importantly, despite       [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
this additional structure, the embeddings are thoroughly            L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At-
learned, suggesting that the approach will be consistent            tention is all you need, 2017. arXiv:1706.03762 .
with high performance beyond the examples we have               [2] L. Rasmy, Y. Xiang, Z. Xie, C. Tao, Med-bert:
studied.                                                            pretrained contextualized embeddings on large-
   As our method scales with and leverages PyTorch au-              scale structured electronic health records for dis-
tograd in the construction of the vector-symbolic em-               ease prediction, npj Digital Medicine 4 (2021) 86.
beddings, it is compatible with existing medical LLM                doi:10.1038/s41746- 021- 00455- y .
architectures as an embedding component capable of              [3] A. Ganesan, H. Gao, S. Gandhi, E. Raff, T. Oates,
encoding domain knowledge.                                          J. Holt, M. McLean, Learning with holographic
   Future work could explore the potential of these struc-          reduced representations, CoRR abs/2109.02157
tured embeddings for explaining and controlling the ob-             (2021). URL: https://arxiv.org/abs/2109.02157.
served frequency bias. As HRRs can be queried with lin-             arXiv:2109.02157 .
ear operations, future work could also explore whether          [4] M. K. Sarker, L. Zhou, A. Eberhart, P. Hitzler, Neuro-
transformers can learn to extract specific information              symbolic artificial intelligence, AI Communications
from these composite embeddings. Limitations to ad-                 34 (2021) 197–209.
dress in future work include the complexity of processing       [5] S. Ramgopal, L. N. Sanchez-Pinto, C. M. Horvat,
knowledge graphs to be compatible with HRRs. Another                M. S. Carroll, Y. Luo, T. A. Florin, Artificial
important limitation is that our method relies on rare-             intelligence-based clinical decision support in pedi-
code HRRs sharing atomic elements with common-code                  atrics, Pediatric research 93 (2023) 334–341.
HRRs. However, in SNOMED CT, rare codes are likely to           [6] T. Yu, T. Tuinstra, B. Hu, R. Rezai, T. Fortin, R. Di-
contain some rare atomic elements. To address this point,           Maio, B. Vartian, B. Tripp, Frequency bias in mlm-
in addition to SNOMED CT, knowledge could be encoded                trained bert embeddings for medical codes, CMBES
from sources such as pre-trained medical embeddings,                Proceedings 45 (2023). URL: https://proceedings.
different medical ontologies, and other medical domain              cmbes.ca/index.php/proceedings/article/view/1050.
knowledge to further improve our proposed methodol-             [7] S. Biderman, H. Schoelkopf, Q. G. Anthony,
ogy. In LLMs that process both medical codes and text,              H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan,
it would make sense to share word embeddings between                S. Purohit, U. S. Prashanth, E. Raff, et al., Pythia:
modalities. This would allow training of each modality              A suite for analyzing large language models across
     training and scaling, in: International Conference            and holistic data representation, in: 2023 Design,
     on Machine Learning, PMLR, 2023, pp. 2397–2430.               Automation Test in Europe Conference Exhibition
 [8] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei,           (DATE), 2023, pp. 1–6. doi:10.23919/DATE56975.
     H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis,            2023.10137134 .
     S. Pfohl, et al., Large language models encode clini-    [20] M. Nickel, L. Rosasco, T. Poggio, Holographic em-
     cal knowledge, Nature 620 (2023) 172–180.                     beddings of knowledge graphs (2015). URL: http:
 [9] T. Plate, Holographic reduced representations, IEEE           //arxiv.org/abs/1510.04935. doi:10.48550/arXiv.
     Transactions on Neural Networks 6 (1995) 623–641.             1510.04935 , arXiv:1510.04935 [cs, stat].
     doi:10.1109/72.377968 .                                  [21] T. Dash, A. Srinivasan, L. Vig,             Incor-
[10] A. Johnson, L. Bulgarelli, T. Pollard, S. Horng, L. A.        porating      symbolic      domain      knowledge
     Celi, R. Mark, Mimic-iv (version 2.0, 2022. URL:              into graph neural networks,               Machine
     https://doi.org/10.13026/7vcr-e114. doi:10.13026/             Learning      110    (2021)    1609–1636.    URL:
     7vcr- e114 .                                                  https://doi.org/10.1007%2Fs10994-021-05966-z.
[11] P. Kanerva, Hyperdimensional computing: An in-                doi:10.1007/s10994- 021- 05966- z .
     troduction to computing in distributed representa-       [22] M. Kulmanov, F. Z. Smaili, X. Gao, R. Hoehn-
     tion with high-dimensional random vectors, Cog-               dorf, Semantic similarity and machine learning
     nitive Computation 1 (2009) 139–159. URL: https:              with ontologies,       Briefings in Bioinformat-
     //api.semanticscholar.org/CorpusID:733980.                    ics 22 (2020) bbaa199. URL: https://doi.org/
[12] R. W. Gayler, Vector symbolic architectures answer            10.1093/bib/bbaa199. doi:10.1093/bib/bbaa199 .
     jackendoff’s challenges for cognitive neuroscience,           arXiv:https://academic.oup.com/bib/article-
     2004. arXiv:cs/0412059 .                                      pdf/22/4/bbaa199/39132158/bbaa199.pdf .
[13] P. Neubert, S. Schubert, Hyperdimensional com-           [23] V. Riikka, V. Anne, P. Sari, Systematized nomencla-
     puting as a framework for systematic aggregation              ture of medicine-clinical terminology (snomed ct)
     of image descriptors (2021). URL: http://arxiv.org/           clinical use cases in the context of electronic health
     abs/2101.07720. doi:10.48550/arXiv.2101.07720 ,               record systems: Systematic literature review, JMIR
     arXiv:2101.07720 [cs].                                        Med Inform (2023). doi:10.2196/43750 .
[14] P. Neubert, S. Schubert, K. Schlegel, P. Protzel, Vec-   [24] L. Song, C. W. Cheong, K. Yin, W. K. Cheung,
     tor semantic representations as descriptors for vi-           B. C. M. Fung, J. Poon, Medical concept embed-
     sual place recognition, in: Robotics: Science and             ding with multiple ontological representations, in:
     Systems XVII, Robotics: Science and Systems Foun-             Proceedings of the Twenty-Eighth International
     dation, 2021. URL: http://www.roboticsproceedings.            Joint Conference on Artificial Intelligence, IJCAI-
     org/rss17/p083.pdf. doi:10.15607/RSS.2021.XVII.               19, International Joint Conferences on Artificial
     083 .                                                         Intelligence Organization, 2019, pp. 4613–4619.
[15] A. Rahimi, P. Kanerva, L. Benini, J. M. Rabaey, Effi-         URL: https://doi.org/10.24963/ijcai.2019/641. doi:10.
     cient biosignal processing using hyperdimensional             24963/ijcai.2019/641 .
     computing: Network templates for combined learn-         [25] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M.
     ing and classification of exg signals, Proceedings of         Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus,
     the IEEE 107 (2019) 123–143. doi:10.1109/JPROC.               G. B. Moody, C.-K. Peng, H. E. Stanley, Phys-
     2018.2871163 .                                                ioBank, PhysioToolkit, and PhysioNet: Compo-
[16] K. Schlegel, P. Neubert, P. Protzel,             Hdc-         nents of a new research resource for complex phys-
     minirocket: Explicit time encoding in time series             iologic signals, Circulation 101 (2000) e215–e220.
     classification with hyperdimensional computing                Circulation Electronic Pages: http://circ.ahajour-
     (2022). URL: http://arxiv.org/abs/2202.08055. doi:10.         nals.org/content/101/23/e215.full PMID:1085218;
     48550/arXiv.2202.08055 , arXiv:2202.08055 [cs].               doi: 10.1161/01.CIR.101.23.e215.
[17] P. Smolensky, Tensor product variable binding and        [26] J. Devlin, M. Chang, K. Lee, K. Toutanova,
     the representation of symbolic structures in con-             BERT: pre-training of deep bidirectional trans-
     nectionist systems, Artificial Intelligence 46 (1990)         formers for language understanding,             CoRR
     159–216. URL: https://www.sciencedirect.com/                  abs/1810.04805 (2018). URL: http://arxiv.org/abs/
     science/article/pii/000437029090007M. doi:https:              1810.04805. arXiv:1810.04805 .
     //doi.org/10.1016/0004- 3702(90)90007- M .               [27] NLM, Snomed ct to icd-10-cm map, 2022. URL:
[18] M. M. Alam, E. Raff, S. Biderman, T. Oates, J. Holt,          https://www.nlm.nih.gov/research/umls/mapping_
     Recasting self-attention with holographic reduced             projects/icd9cm_to_snomedct.html.
     representations, 2023. arXiv:2305.19534 .                [28] OHDSI, Ohdsi standardized vocabularies, 2019.
[19] J. Kim, H. Lee, M. Imani, Y. Kim, Efficient hyper-            URL: https://github.com/OHDSI/Vocabulary-v5.0/
     dimensional learning with trainable, quantizable,             wiki.
[29] NLM, Icd-9-cm diagnostic codes to snomed ct map,          26. G249-10: Dystonia, unspecified
     2022. URL: https://www.nlm.nih.gov/research/              27. 9100-9: Abrasion or friction burn of face, neck,
     umls/mapping_projects/icd9cm_to_snomedct.                     and scalp except eye, without mention of infec-
     html.                                                         tion
[30] NCHS, Diagnosis code set general equivalence              28. 78906-9: Abdominal pain, epigastric
     mappings, 2018. URL: https://ftp.cdc.gov/pub/             29. E8889-9: Unspecified fall
     health_statistics/nchs/Publications/ICD10CM/              30. 30500-9: Alcohol abuse, unspecified
     2018/Dxgem_guide_2018.pdf.
                                                               31. G520-10: Disorders of olfactory nerve
[31] T. J. Pollard, A. E. W. Johnson, J. D. Raffa, L. A. Celi,
     R. G. Mark, O. Badawi, The eICU Collaborative             32. 8020-9: Closed fracture of nasal bones
     Research Database, a freely available multi-center
     database for critical care research, Scientific data 5 A.1. Learning through HRR Operations
     (2018) 1–13.                                                 Efficiently
                                                             To make the HRR concept embeddings useful for a deep
                                                             neural network, the operations used to form the embed-
A. List of 32 ROOD Codes                                     dings need to be compatible with backpropagation so
                                                             that gradient descent can update the lower-level atomic
The following is the list of 32 ROOD codes:                  vectors. We desired a function that produced the ICD
    1. G248-10: Other dystonia                               concept embedding matrix, 𝐶, given the inputs of the VSA
    2. E8498-9: Accidents occurring in other specified       knowledge graphs, 𝒢𝑖 , and symbol embedding matrices,
       places                                                𝑅 and 𝐴.
    3. E9688-9: Assault by other specified means                We attempted three approaches to computing 𝐶
                                                             through VSA operations. First, we naively tried to com-
    4. Z681-10: Body mass index (BMI) 19.9 or less, adult
                                                             pute each concept vector in 𝐶 one at a time. However,
    5. 30550-9: Opioid abuse, unspecified
                                                             this approach was too slow in both forward and back-
    6. R262-10: Difficulty in walking, not elsewhere clas-   ward pass, requiring more than 1 second for each pass.
       sified                                                Our second approach was using slices of 𝐺 along the re-
    7. E887-9: Fracture, cause unspecified                   lationship dimension as a sparse binary matrix, which,
    8. R471-10: Dysarthria and anarthria                     when multiplied with 𝐴, would perform the indexing
    9. 9916-9: Hypothermia                                   and summing of atomic vectors for each concept. This
   10. E9010-9: Accident due to excessive cold due to        result can be convolved with the relationship vector and
       weather conditions                                    added to the concept embedding matrix. This approach
   11. F10129-10: Alcohol abuse with intoxication, un-       was much faster and used a moderate amount of memory
       specified                                             for one of our less complex VSA formulations. However,
   12. E8499-9: Accidents occurring in unspecified place     when dealing with our most complex formulation, it used
   13. R636-10: Underweight                                  ∼15 GB of memory.
   14. 920-9: Contusion of face, scalp, and neck except         Our final approach took advantage of the fact that
       eye(s)                                                many disease concepts use relationship, but to different
                                                             atomic symbols. Also, number of times a concept uses
   15. R4182-10: Altered mental status, unspecified
                                                             a particular relationship is relatively low, except for the
   16. 95901-9: Head injury, unspecified                     SNOMED “isA” relationship and our defined “descrip-
   17. 78097-9: Altered mental status                        tion” relationship. Thus, for a particular relationship,
   18. F29-10: Unspecified psychosis not due to a sub-       we can contribute to building many disease concept vec-
       stance or known physiological condition               tors at once by selecting many atomic vectors, doing a
   19. Z880-10: Allergy status to penicillin                 vectorized convolution with the relationship vector, and
   20. Z818-10: Family history of other mental and be-       distributing the results to be added with the appropriate
       havioral disorders                                    concept embedding rows. This step needs to be repeated
   21. 81600-9: Closed fracture of phalanx or phalanges      at most 𝑚 times for a particular relationship, where 𝑚 is
       of hand, unspecified                                  the maximum multiplicity of that relationship among all
   22. 87341-9: Open wound of cheek, without mention         concepts. We improved memory efficiency by perform-
       of complication                                       ing fast Fourier transforms (FFTs) on the atomic vector
   23. H9222-10: Otorrhagia, left ear                        embeddings and construct the concept vectors by per-
   24. Z978-10: Presence of other specified devices          forming binding via element-wise multiplication in the
                                                             Fourier domain. Due to the linearity of the HRR opera-
   25. G20-10: Parkinson’s disease
tions, we performed a final FFT on the complex-valued
concept embedding to convert back to the real domain.
   The final approach is much faster than the first ap-
proach since it takes advantage of vectorized operations
to contribute to many concept vectors at once. It is also
more memory efficient than the second approach since
all the intermediate results are dense, so allocations are
not wasted on creating mostly sparse results. On our
most complex formulation, this approach uses ∼3.5 GB
of memory, and takes ∼80 ms and ∼550 ms for forward
and backward pass respectively.