=Paper=
{{Paper
|id=Vol-3894/paper13
|storemode=property
|title=Encoding Medical Ontologies With Holographic Reduced Representations for Transformers
|pdfUrl=https://ceur-ws.org/Vol-3894/paper13.pdf
|volume=Vol-3894
|authors=Bing Hu,Trevor Yu,Tia Tuinstra,Ryan Rezai,Harshit Bokadia,Rachel DiMaio,Thomas Fortin,Brian Vartian,Bryan Tripp
|dblpUrl=https://dblp.org/rec/conf/kil/HuYTRBDFVT24
}}
==Encoding Medical Ontologies With Holographic Reduced Representations for Transformers==
Encoding Medical Ontologies With Holographic Reduced
Representations for Transformers
Bing Hu1,∗ , Trevor Yu1 , Tia Tuinstra1 , Ryan Rezai1 , Harshit Bokadia1 , Rachel DiMaio1 ,
Thomas Fortin1 , Brian Vartian1,2 and Bryan Tripp1
1
University of Waterloo, Ontario, Canada
2
McMaster University, Ontario, Canada
Abstract
Transformer models trained on NLP tasks with medical codes often have randomly initialized embeddings that are then
adjusted based on training data. For terms appearing infrequently in the dataset, there is little opportunity to improve
these representations and learn semantic similarity with other concepts. Medical ontologies represent many biomedical
concepts and define a relationship structure between these concepts, making ontologies a valuable source of domain-specific
information. Holographic Reduced Representations (HRR) are capable of encoding ontological structure by composing atomic
vectors to create structured higher-level concept vectors. We developed an embedding layer that generates concept vectors for
clinical diagnostic codes by applying HRR operations that compose atomic vectors based on the SNOMED CT ontology. This
approach allows for learning the atomic vectors while maintaining structure in the concept vectors. We trained a Bidirectional
Encoder Representations from the Transformers (BERT) model to process sequences of clinical diagnostic codes and used the
resulting HRR concept vectors as the embedding matrix for the model. The HRR-based approach introduced interpretable
structure into code embeddings while maintaining or modestly improving performance on the masked language modeling
(MLM) pre-training task (particularly for rare codes) as well as the fine-tuning tasks of mortality and disease prediction. This
approach also better maintains semantic similarity between medically related concept vectors, due to both shared atomic
vectors and disentangling of code-frequency information.
Keywords
Deep Learning, Ontology, Knowledge-Integration
1. Introduction in medical applications [5]. Standard large language mod-
els (LLMs) can be prone to biases in the training data,
Transformers [1] jointly optimize high-dimensional vec- such as frequency bias, which can result in medical mis-
tor embeddings that represent input tokens, and a net- information and potentially clinical harm [6, 7, 8].
work that contextualizes and transforms these embed- Here we use a novel neuro-symbolic medical trans-
dings to perform a task. Originally designed for natu- former architecture incorporating structured knowledge
ral language processing (NLP) tasks, transformers are from an authoritative medical ontology into the embed-
now widely used with other data modalities. In medical dings. Specifically, we use vector-symbolic holographic
applications, one important modality consists of medi- reduced representations (HRRs) [9] to produce composite
cal codes that are extensively used in electronic health medical-code embeddings and backpropagate through
records (EHR). A prominent example in this space is Med- the architecture to optimize the embeddings of atomic
BERT [2], which consumes a sequence of diagnosis codes. concepts. This approach produces optimized medical
Tasks that Med-BERT and other EHR-transformers per- code embeddings with an explicit structure that incorpo-
form include disease and mortality prediction. rates medical knowledge.
Deep networks have traditionally been alternatives to We test our method, Holographic Reduced Representa-
symbolic artificial intelligence with different advantages tion Bi-directional Encoder Representations from Trans-
[3]. Deep networks use real-world data effectively, but formers (HRRBERT), on the Medical Information Mart
symbolic approaches have completive properties, such as for Intensive Care (MIMIC)-IV dataset [10] and show im-
better transparency and capacity for incorporating struc- provements in both pre-training and fine-tuning tasks.
tured information, inspiring many efforts to combine the We also show that our embeddings of ontologically sim-
two approaches in neuro-symbolic systems [4]. Addi- ilar rare medical codes have high cosine similarity, in
tional transparency and ability to incorporate structured contrast with embeddings that are learned in the stan-
information are potential benefits of symbolic approaches dard way. Finally, we investigate learned representations
of medical-code frequency, in light of recent demonstra-
KiL’24: Workshop on Knowledge-infused Learning co-located with
30th ACM KDD Conference, August 26, 2024, Barcelona, Spain
tion of frequency bias in EHR-transformers [6].
∗
Corresponding author. We contribute:
Envelope-Open bingxu.hu@uwaterloo.ca (B. Hu)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License • A novel neuro-symbolic architecture, HRRBERT,
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
that combines vector-symbolic embeddings with upon Smolensky’s by using circular convolution as the
the BERT LLM architecture, leading to better per- binding operator [9]. Circular convolution keeps the
formance in medical tasks. output in the same dimension, solving the problem of
• Efficient construction of vector-symbolic embed- exploding dimensionality.
dings that leverage PyTorch autograd on GPUs. In the field of deep learning, HRRs have been used
• Optimized medical-code embeddings that better in previous work to recast self-attention for transformer
respect semantic similarity of medical terminol- models [18], to improve the efficiency of neural networks
ogy than standard embeddings for infrequently performing a multi-label classification task by using an
used codes. HRR-based output layer [3], and as a learning model itself
with a dynamic encoder that is updated through train-
We focus here on processing medical codes, but our ing [19]. In all of these works, the efficiency and simple
methods would extend naturally to foundation models arithmetic of HRRs are leveraged. Our work differs in
that combine medical codes and natural language. Specif- that we also leverage the ability of HRRs to create struc-
ically, the trained atomic vectors of our vector-symbolic tured vectors to represent complex concepts as inputs to
embeddings could share a dictionary with language em- a transformer model.
beddings, so that training of each could improve the VSAs such as HRRs can effectively encode domain
representation of the other. knowledge, including complex concepts and the relation-
ships between them. For instance, Nickel et al. [20] pro-
1.1. Background and Related Works pose holographic embeddings that make use of VSA prop-
erties to learn and represent knowledge graphs. Encod-
The Vector-Symbolic Architectures (VSA) approach is a ing domain knowledge is of interest in the field of deep
computing paradigm that relies on high dimensionality learning, as it could improve, for example, a deep neural
and randomness to represent concepts as unique vectors network’s ability to leverage human knowledge and to
in a high dimensional space [11]. VSAs create and manip- communicate its results within a framework that humans
ulate distributed representations of concepts by combin- understand [21]. Ontologies are a form of domain knowl-
ing base vectors with bundling, binding, and permutation edge incorporated into machine learning models to use
algebraic operators [12]. For example, a scene with a red background knowledge to create embeddings with mean-
box and a green ball could be described with the vector ingful similarity metrics and for other purposes [22]. In
SCENE=RED⊗BOX+GREEN⊗BALL, where ⊗ indicates our work, we use HRRs to encode domain knowledge
binding, and + indicates bundling. The atomic concepts in trainable embeddings for a transformer model. The
of RED, GREEN, BOX, and BALL are represented by base domain knowledge we use comes from the Systematized
vectors, which are typically random. VSAs also define Nomenclature of Medicine Clinical Terms (SNOMED CT),
an inverse operation that allows the decomposition of a which is a widely used clinical ontology system that in-
composite representation. For example, the scene rep- cludes definitions of relationships between clinical con-
resentation could be queried as SCENE⊗BOX−1 . This cepts [23].
should return the representation of GREEN or an approx- To the best of our knowledge, HRRs have not been used
imation of GREEN that is identifiable when compared to before as embeddings for transformer models. Trans-
a dictionary. In a VSA, the similarity between concepts former models typically use learned embeddings with
can be assessed by measuring the distance between the random initializations [1]. However, in the context of rep-
two corresponding vectors. resenting ontological concepts, using such unstructured
VSAs were proposed to address challenges in mod- embeddings can have undesirable effects. One problem
elling cognition, particularly language [12]. However, is the inconsistency between the rate of co-occurrence
VSAs have been successfully applied across a variety of or patterns of occurrence of medical concepts and their
domains and modalities outside of the area of language degree of semantic similarity described by the ontology.
as well, including in vision [13, 14], biosignal process- For example, the concepts of “Type I Diabetes” and “Type
ing [15], and time-series classification [16]. Regardless II Diabetes” are mutually exclusive in EHR data and do
of the modality or application, VSAs provide value by not follow the same patterns of occurrence due to dif-
enriching vectors with additional information, such as ferences in pathology and patient populations [24]. The
spatial semantic information in images and global time differences in occurrence make it difficult for a trans-
encoding in time series. former model to learn embeddings with accurate simi-
An early VSA framework was Smolensky’s Tensor larity metrics. The concepts should have relatively high
Product Representation [17], which addressed the need similarity according to the ontology. They both share a
for compositionality, but suffered from exploding model common ancestor of “Diabetes Mellitus,” they are both
dimensionality. The VSA framework introduced by Plate, metabolic disorders that affect blood glucose levels, and
Holographic Reduced Representations (HRR), improved they can both lead to similar health outcomes. Song et al.
[24] seeks to address this type of inconsistency by train- [26], including the atomic vectors for HRR embeddings.
ing multiple “multi-sense” embeddings for each non-leaf Fine-tuning used a constant learning rate schedule with a
node in an ontology’s knowledge graph via an attention weight decay of 4e-6. Fine-tuning lasted 10 epochs with
mechanism. However, the “multi-sense” embeddings do a batch size of 80.
not address the learned frequency-related bias that also
arises from the co-occurrence of concepts. Frequency- 2.3. Encoding SNOMED Ontology with
related bias raises an explainability issue, as it leads to
learned embeddings that do not reflect true similarity
HRR Embeddings
relationships between concepts, for example, as defined In this section, we detail the methodologies of construct-
in an ontology, but instead reflect the frequency of the ing vector embeddings for ICD disease codes using HRR
concepts in the dataset [6]. This bias particularly affects operations based on the SNOMED CT structured clini-
codes that are used less frequently. cal vocabulary. We first describe our mapping from ICD
Our proposed approach, HRRBERT, uses the structure concepts to SNOMED CT terms. Next, we define how
from SNOMED CT to represent thousands of concepts the atomic symbols present in the SNOMED CT ontology
with high-dim-ensional vectors such that each vector are combined using HRR operations to construct con-
reflects a particular clinical meaning and can be compared cept vectors for the ICD codes. Finally, we describe our
to other vectors using the HRR similarity metric, cosine method to efficiently compute the HRR embedding ma-
similarity. It also leverages the computing properties of trix using default PyTorch operations that are compatible
HRRs to provide structured embeddings for a LLM that with autograd.
supports optimization through backpropagation.
2.3.1. Mapping ICD to SNOMED CT Ontology
2. Methods Our data uses ICD-9 and ICD-10 disease codes while
our symbolic ontology is defined in SNOMED CT, so we
2.1. MIMIC-IV Dataset required a mapping from the ICD to the SNOMED CT
system to build our symbolic architecture. We used the
The data used in this study was derived from the Med-
SNOMED CT International Release from May 31, 2022
ical Information Mart for Intensive Care (MIMIC) v2.0
[23] and only included SNOMED CT terms that were
database, which is composed of de-identified EHRs from
active at the time of that release. While SNOMED pub-
in-patient hospital visits between 2008 and 2019 [10].
lishes a mapping tool from SNOMED CT to ICD-10, a
MIMIC-IV is available through PhysioNet [25]. We used
majority of ICD-10 concepts have one-to-many mappings
the ICD-9 and ICD-10 diagnostic codes from the icd_di-
in the ICD-to-SNOMED CT direction [27]. To increase
agnosis table from the MIMIC-IV hosp module. We fil-
the fraction of one-to-one mappings, we used additional
tered patients who did not have at least one diagnostic
published mappings from the Observational Medical Out-
code associated with their records. Sequences of codes
comes Partnership (OMOP) [28], mappings from ICD-9
were generated per patient by sorting their hospital visits
directly to SNOMED CT [29], and mappings from ICD-10
by time. Within one visit, the order of codes from the
to ICD-9 [30].
MIMIC-IV database was used, since it represents the rel-
Notably, after excluding ICD codes with no active
ative importance of the code for that visit. Each unique
SNOMED CT mapping, 671 out of the 26,164 unique
code was assigned a token. In total, there were 189,980
ICD codes in the MIMIC-IV dataset were missing map-
patient records in the dataset. We used 174,890 patient
pings. When those individual codes were removed, a
records for pre-training, on which we performed a 90–10
data volume of 4.62% of codes was lost. This removed 58
training-validation split. We reserved 15k records for
out of 190,180 patients from the dataset, as they had no
fine-tuning tasks.
valid ICD codes in their history. Overall, the remaining
25,493 ICD codes mapped to a total of 12,263 SNOMED
2.2. Model Architecture CT terms.
We utilized a BERT-base model architecture with a post-
layer norm position and a sequence length of 128 ICD 2.3.2. SNOMED CT vector symbolic architecture
codes [26]. A custom embedding class was used to sup- Next, we define how the contents of the SNOMED CT
port the functionality required for our HRR embeddings. ontology were used to construct a symbolic graph to
We adapted the BERT segment embeddings to represent represent ICD concepts. For a given SNOMED CT term,
groups of codes from the same hospital visit, using up we used its descriptive words and its relationships to
to 100 segment embeddings to encode visit sequencing. other SNOMED CT terms. A relationship is defined by
An embedding dimension of 𝑑 = 768 was used, and all a relationship type and a target term. In total, there
embeddings were initialized from 𝑥 ∼ 𝒩𝑑 (0, 0.02), as in
were 13,852 SNOMED CT target terms and 40 SNOMED autograd to learn through these HRR operations are pro-
CT relationship types used to represent all desired ICD vided in Appendix A.1.
concepts. In the ontology, many ICD concepts share
SNOMED CT terms in their representations. 2.3.3. Embedding Configurations
The set of relationships was not necessarily unique
for each SNOMED CT term. To add more unique in- We call our method of constructing embeddings for ICD
formation, we used a term’s “fully specified name” and codes purely from HRR representations “HRRBase” and
any “synonyms” as an additional set of words describing the standard method of creating transformer token em-
that term. We set all text to lowercase, stripped punctua- beddings from random vectors “unstructured”. While the
tion, and split on spaces to create a vocabulary of words. HRRBase configuration enforces the ontology structure,
We removed common English stopwords from a custom we wondered whether it would be too rigid and have dif-
stopword list that was collected with assistance from a ficulty representing information not present in SNOMED
medical physician. The procedure resulted in a total of CT. As dataset frequency information for ICD medical
8833 vocabulary words. codes is not present in the HRR structure, we tried adding
Overall, there were a total of 22,725 “atomic” symbols an embedding that represented the empirical frequency
for the VSA which included the SNOMED CT terms, rela- of that ICD code in the dataset. We also tried adding fully
tionships, and the description vocabulary. Each symbol learnable embeddings with no prior structure.
was assigned an “atomic vector”. We built a “concept vec- Given the wide range of ICD code frequencies in
tor” for each of the target 25,493 ICD codes using HRR MIMIC, we log-transformed the empirical ICD code fre-
operations to combine atomic vectors according to the quencies, and then discretized the resulting range. For
SNOMED CT ontology structure. our HRRFreq configuration, we used the sinusoidal fre-
To build a 𝑑-dimensional concept vector for a given quency encoding as in [1] to encode the discretized log-
ICD concept, we first considered the set of all relation- frequency information. The frequency embeddings were
ships that the concept maps to. We used the HRR opera- normalized before being summed with the HRR embed-
tor for binding, circular convolution, to combine vectors ding vectors.
representing the relationship type and destination term We defined two additional configurations in which
and defined the concept vector to be the bundling of a standard embedding vector was integrated with the
these bound relationships. For the description words, structured HRR concept vector. With “HRRAdd”, a learn-
we bundled the vectors representing each word together able embedding was added to the concept embedding,
and bound this result with a new vector representing the HRRAdd = 𝐶 + 𝐿add , 𝐿add ∈ ℝ𝑁𝑐 ×𝑑 . However, this roughly
relationship type “description,” as shown in Equation 1. doubled the number of learnable parameters compared
to other formulations.
𝑥ICD concept = ∑ 𝑥rel ⊛𝑥term + ∑ 𝑥desc ⊛𝑥word With “HRRCat”, a learnable embedding of dimension
SNOMED CT words 𝑑/2 was concatenated with the HRR concept embed-
(1) ding of dimension 𝑑/2. This keeps the total number
Formally, let 𝔸 ∶ {1, 2, ..., 𝑁𝑎 } be the set of integers of learnable parameters roughly the same as the unstruc-
enumerating the unique atomic symbols for SNOMED tured configuration (25,493 𝑑-dimensional vectors) and
CT terms and description words. Let 𝔹 ∶ {1, 2, ..., 𝑁𝑟 } be the HRRBase configuration (22,725 𝑑-dimensional vec-
the set of integers enumerating unique relationships for tors). The final embedding matrix was defined as HRRCat
SNOMED CT terms, including the description relation- = [𝐶 𝐿cat ], where 𝐶, 𝐿cat ∈ 𝑅𝑁𝑐 ×𝑑/2 .
ship and the binding identity. Let 𝔻 ∶ {1, 2, ..., 𝑁𝑐 } be the
set of integers enumerating the ICD-9 and ICD-10 disease
concepts represented by the VSA. 2.4. Experiments
𝔸 has an associated embedding matrix 𝐴 ∈ ℝ 𝑁 𝑎 ×𝑑 , We pre-trained the unstructured, HRRBase, HRRCat, and
where atomic vector 𝑎𝑘 = 𝐴[𝑘,∶] , 𝑘 ∈ 𝔸 is the 𝑘-th row HRRAdd embedding configurations of HRRBERT on the
the embedding matrix. Similarly, there is relationship masked language modelling (MLM) task, for 3 trials each.
embedding matrix, 𝑅 ∈ ℝ𝑁𝑟 ×𝑑 and 𝑟𝑗 = 𝑅[𝑗,∶] , 𝑗 ∈ 𝔹; For each of the 3 pre-trained models, 10 fine-tuning trials
and an ICD concept embedding matrix, 𝐶 ∈ ℝ𝑁𝑐 ×𝑑 and were conducted for a total of 30 trials per fine-tuning task.
𝑐𝑖 = 𝐶[𝑖,∶] , 𝑖 ∈ 𝔻. We described the VSA with the formula The best checkpoint from the 10 epochs of fine-tuning
in Equation 2, where 𝒢𝑖 is a graph representing the con- was saved based on validation performance. A test set
nections between ICD concept 𝑖 to atomic symbols 𝑘 by containing 666 patient records was used to evaluate each
relationship 𝑗. of the fine-tuned models for both mortality and disease
𝑐𝑖 = ∑ 𝑟𝑗 ⊛ 𝑎𝑘 (2) prediction. We report accuracy, precision, recall, and
(𝑗,𝑘)∈𝒢𝑖 F1 scores averaged over the 30 trials for the fine-tuning
Additional details on how to efficiently use PyTorch tasks.
3. Experimental Results date. A training set of 13k patient records along with a
validation set of 2k patient records were used to fine-tune
3.1. Pre-training each model on mortality prediction. Table 1 shows the
evaluation results of mortality prediction for each of the
configurations. We performed a two-sided Dunnett’s test
to compare our multiple experimental HRR embedding
configurations to the control unstructured embeddings,
with 𝑝 < 0.05 significance level. HRRBase embeddings
had a significantly greater mean F1-score (𝑝 = 0.043)
and precision (𝑝 = 0.042) compared to unstructured em-
beddings.
3.2.2. Disease Prediction Task
The disease prediction task is defined as predicting which
Figure 1: Pre-training validation set evaluation results fordisease chapters were recorded in the patient’s last visit
different configurations using information from earlier visits. We converted all
ICD codes in a patient’s last visit into a multi-label bi-
MLM accuracy is evaluated on a validation set over the nary vector of disease chapters. As there are 22 disease
course of pre-training. Pre-training results for different chapters defined in ICD-10, the multi-label binary vector
configurations are shown in Figure 1. The pre-training has a size of 22 with binary values corresponding to the
results are averaged over 3 runs for each of the configu- presence of a disease in each chapter. A training set of
rations except for HRRFreq where only 1 model run was 4.5k patient records along with a validation set of 500
completed. patient records were used to fine-tune each model on
The baseline of learned unstructured embeddings has this task. Table 1 shows the evaluation results of disease
a peak pre-training validation performance of around prediction for each of the configurations. For the two-
33.4%. HRRBase embeddings perform around 17% worse sided Dunnett test, Levene’s test shows that the equal
compared to the baseline of learned unstructured embed- variance condition is satisfied, and the Shapiro-Wilk test
dings. We hypothesize that this decrease in performance suggests normal distributions except for HRRAdd accu-
is due to a lack of embedded frequency information in racy. The test showed HRRBase embeddings had a signif-
HRRBase compared to learned unstructured embeddings. icantly greater mean accuracy (𝑝 = 0.033) and precision
HRRFreq (which combines SNOMED CT information (𝑝 = 0.023) compared to unstructured embeddings. No
with frequency information) has a similar performance other comparisons of mean metrics for HRR embeddings
compared to unstructured embeddings, supporting this were significantly greater than the control.
hypothesis. Compared to baseline, HRRAdd and HRRCat
improve pre-training performance by a modest margin of 3.2.3. eICU Mortality Prediction
around 2%. We posit that this almost 20% increase in per-
An additional experiment conducted on the Philips Elec-
formance of HRRCat and HRRAdd over HRRBase during
tronic Intensive Care Unit (eICU) [31] shows corrobo-
pre-training is partly due to the fully learnable embed-
rating results with the MIMIC-IV experiments. For our
ding used in HRRCat and HRRAdd learning frequency
experiment, we applied our mortality prediction models
information.
that were fine-tuned on MIMIC-IV to eICU data to see
if our results generalize. Table 1 shows that HRRBase
3.2. Fine-tuning embeddings had a significantly greater mean accuracy
(𝑝 = 0.046) compared to unstructured embeddings when
We fine tuned the networks for mortality prediction and
applied to the eICU dataset. These models are not opti-
disease prediction. Across metrics and tasks, the best
mized for mortality prediction for other hospitals where
results were often seen in HRRBase (Table 1) with some
coding methodology and clinical practice may differ. For
being statistically significant.
example, the most common code in the eICU dataset
represents acute respiratory failure, whereas the most
3.2.1. Mortality Prediction Task common code in the MIMIC-IV dataset represents hyper-
The mortality prediction task is defined as predicting tension.
patient mortality within 6 months after the last visit. Bi-
nary mortality labels were generated by comparing the
time difference between the last visit and the mortality
Table 1
Finetuning mean test scores and standard deviations for mortality prediction, disease prediction, eICU mortality prediction,
and both Really-Out-Of-Distribution (ROOD) Unseen and Overall disease prediction tasks. The best scores are bolded and are
underlined if statistically significant.
Finetuning Task Configuration Accuracy Precision Recall F1-Score
ROOD HRRBase 94.9±1.0 83.5±4.6 76.8±5.1 79.5±4.9
Unseen Unstructured 92.3±0.3 46.2±0.0 50.0±0.1 48.0±0.1
ROOD HRRBase 81.9±0.1 78.3±0.3 75.2±0.8 76.4±0.5
Overall Unstructured 81.9±0.2 78.7±0.7 74.4±1.2 76.0±0.8
HRRBase 84.4±2.3 65.8±2.0 85.6±2.2 69.2±2.7
Mortality HRRAdd 84.0±2.2 65.7±1.9 85.7±2.3 68.9±2.5
Prediction HRRCat 83.9±2.3 65.6±1.7 84.9±2.8 68.8±2.5
Unstructured 83.4±1.9 64.9±1.2 84.6±2.2 67.9±1.8
HRRBase 79.9±0.5 73.0±1.2 67.2±0.7 69.0±0.6
Disease HRRAdd 79.6±0.7 72.6±1.4 67.3±0.9 69.0±0.6
Prediction HRRCat 79.6±0.8 72.5±1.7 67.3±1.0 68.9±0.8
Unstructured 79.4±0.5 72.1±1.1 67.8±1.0 69.2±0.7
eICU HRRBase 68.9±1.3 75.0±1.8 57.0±5.8 64.5±3.5
Mortality HRRAdd 68.1±1.6 74.0±2.2 56.2±6.8 63.6±3.9
Prediction HRRCat 68.2±1.2 73.8±2.6 57.0±7.2 64.0±3.7
3.2.4. Really-Out-Of-Distribution (ROOD) Disease on other codes. Unstructured embeddings cannot learn
Prediction better representations for codes never seen in training.
We conducted an additional disease-prediction experi-
ment to test generalization to patients with codes outside 3.3. t-SNE of Frequency Bias
the training distribution. We found six patients with
records that consisted of only 32 codes between them
(see list of codes in Appendix A). We created a really-
out-of-distribution (ROOD) dataset that consisted of all
patients in MIMIC-IV (nearly 30K) with at least one of
these codes. We used this as a validation set. The sepa-
rate pre-training and fine-tuning dataset did not contain Figure 2: Comparing t-SNE of (a) unstructured embeddings,
these codes. We also created a smaller validation dataset (b) HRRAdd, (c) HRRCat, and (d) HRRBase. The t-SNE graphs
consisting of the six patients with only these codes. Dur- are color-coded by the frequency of the ICD codes in the
ing pretraining, the HRRBase and unstructured models dataset - highly frequent codes are colored blue while infre-
did not encounter any examples using the 32 ROOD codes quent codes are colored red.
and so did not explicitly learn representations for those
codes. The trained models were then tested using the We computed t-SNE dimension reductions to visual-
ROOD dataset. ize relationships among ICD code embeddings in the
Results from Table 1 on ROOD dataset disease predic- pre-trained models. Figure 2 shows that unstructured
tion show that HRRBase outperforms the unstructured embeddings of common ICD codes are clustered together
embedding model for contexts of entirely unseen codes. with a large separation from those of uncommon codes.
We assess statistical significance using two-tailed, in- This suggests that code-frequency information is promi-
dependent t-test with unequal variance, as some mea- nently represented in these embeddings, consistent with
surements failed Levene’s test for equal variance. The frequency bias in related models [6]. Common and un-
means of all the metrics for HRRBase are significantly common code clusters are less distinct in HRRBase, which
greater than for unstructured when making inferences does not explicitly encode frequency information.
on patients with entirely unseen codes, 𝑝 < 0.001 for all As shown in Figure 1, adding code-frequency infor-
metrics. Given the embedded ontological structure, we mation to the structured HRRBase embeddings, i.e. the
hypothesize that HRRBase implicitly learns useful em- HRRFreq embeddings, improved the pre-training loss be
beddings for the 32 unseen ROOD codes by learning any similar to unstructured embeddings. This suggests that
shared embedding components of the VSA when training unstructured components in HRRAdd and HRRCat may
Figure 3: t-SNE representation of sinusoidal frequency em-
beddings (left), and unstructured embedding components of
HRRAdd (middle) and HRRCat (right).
have learned some frequency information, since these
losses are also similar to the loss of models with Unstruc-
tured embeddings. To investigate whether this occurred,
we performed t-SNE dimension reductions of the unstruc-
tured components of HRRAdd and HRRCat and colored
the points by code frequency, shown in Figure 3. This Figure 4: The top-10 MLM accuracy for binned code frequen-
cies in log scale. Common codes are in frequency bin 0 with
graph suggests that these additional unstructured em-
rarest codes being in frequency bin -12. 0.05, 0.01, and 0.001
beddings learn some frequency information, due to clus- significance levels comparing to unstructured embeddings are
tering of high frequency codes. However, the frequency indicated with 1, 2, and 3 asterisks respectively. Note that
information learned by HRRCat and HRRAdd learnable HRRBase is expected to perform poorly in this test due to lack
embeddings influence overall embeddings less strongly of code-frequency information.
in comparison to unstructured embeddings as seen in Fig-
ure 2, where low frequency embeddings are less distinctly
separated from higher frequency embeddings.
3.4. Top-k Accuracy for MLM
Accurately predicting infrequently used disease codes
is an important clinically relevant task. Given that the
model trains and sees more common codes compared
to rare codes, rare codes are naturally challenging to
predict. Through promising empirical results on out-of-
distribution mortality prediction for eICU and disease
prediction on ROOD, we hypothesized that our HRR em-
bedding models should have improved accuracy when
predicting rare codes in the dataset compared to unstruc-
tured embedding models, since rare codes should share
some atomic vectors in their representations with com-
mon codes. Figure 5: The top-100 MLM accuracy for binned code frequen-
To test this, we evaluated the accuracy of an MLM cies in log scale. Common codes are in frequency bin 0 with
pre-trained model predicting a single masked code of a rarest codes being in frequency bin -12. 0.05, 0.01, and 0.001
known frequency. We split the codes in the pre-training significance levels comparing to unstructured embeddings are
validation dataset into 7 bins from log frequency -14 to 0, indicated with 1, 2, and 3 asterisks respectively.
such that each bin has a width of 2. The most common
codes are in a bin with log frequencies between -2 and 0,
while the rarest codes are from a bin with log frequencies ent frequency bins, averaged across the three pre-training
between -14 and -12. From each bin, we selected 400 models per configuration. Significant comparisons to the
codes at random, repeating codes from that bin if there unstructured control at a 𝑝 < 0.05 level indicated with
were fewer than 400. For each of these codes, we selected an asterisk. We assess statistical significance for each bin
one patient that had that code in their history, masked using a two-tailed Dunnett’s test comparing mean accu-
that code as would be done in MLM, and created a dataset racy scores of experimental HRR configurations against
of these 2,800 patients to use for MLM inference. the control unstructured configuration. Notably, the top-
Figure 4 and Figure 5, respectively, show the MLM top- 100 accuracy in frequency bin -12 is non-zero for the
10 and Top-100 accuracy on predicting codes in the differ- HRR methods. These codes in the rarest bin occur only
Table 2
Three cosine similarity case studies looking at related ICD codes for unstructured and HRRBase. The top 4 cosine-similar ICD
codes to the chosen code are listed (most to least similar) with their full description and similarity value.
2724-9 - Other and unspecified hyperlipidemia
Unstructured HRRBase
Pure hypercholesterolemia 0.542 Other hyperlipidemia 1.000
Hyperlipidemia, unspecified 0.482 Hyperlipidemia, unspecified 1.000
Esophageal reflux 0.304 Pure hypercholesterolemia 0.463
Anemia, unspecified 0.279 Mixed hyperlipidemia 0.418
9916-9 - Hypothermia
Unstructured HRRBase
Frostbite of hand 0.418 Hypothermia, initial encounter 0.794
Frostbite of foot 0.361 Hypothermia not with low env. temp. 0.592
Drowning and nonfatal submersion 0.352 Effect of reduced temp., initial encounter 0.590
Immersion foot 0.341 Other specified effects of reduced temp. 0.590
K219-10 - Gastro-esophageal reflux disease without esophagitis
Unstructured HRRBase
Esophageal reflux 0.565 Esophageal reflux 0.635
Hyperlipidemia, unspecified 0.335 Gastro-eso. reflux d. with esophagitis 0.512
Anxiety disorder, unspecified 0.332 Reflux esophagitis 0.512
Essential (primary) hypertension 0.326 Hypothyroidism, unspecified 0.268
once in the dataset and therefore have never been used between structured and unstructured embeddings. 30
by the model for gradient updates, since they are in the ICD codes were selected from different frequency cate-
validation dataset. This suggests that the HRR methods gories in the dataset, with 10 codes drawn randomly from
have some ability to provide clinically relevant informa- the 300 most common codes, 10 codes drawn randomly
tion about rare codes. However, accuracy with the rarest by weighted frequency from codes appearing fewer than
codes remains too low to be of practical value, perhaps 30 times in the dataset, and 10 codes randomly selected
due to limited overlap of these codes’ atomic vectors with by weighted frequency from the entire dataset. For each
those of more common codes. selected code, the top 4 cosine-similar ICD codes were
assessed by a physician for ontological similarity.
3.5. Medical Code Case Study For each frequency category, a one-tailed Fisher’s exact
test was conducted to determine whether a relationship
Table 2 shows case studies for codes Other and un- existed between embedding type and clinical relatedness.
specified hyperlipidemia (2724-9), Hypothermia (9916-9), We found that results in the case of the rare codes were
and Gastro-esophageal Reflux disease without esophagi- statistically significant, with 𝑝 = 2.44×10−8 . With 10 rare
tis (K219-10). In the first case study for 2724-9, we ob- codes and the top 4 cosine-similar ICD codes selected for
serve highly ontologically similar codes, such as Other each rare code, there are 40 top cosine-similar codes in to-
hyperlipidemia and Hyperlipidemia, unspecified, are en- tal. In the case of unstructured embeddings, only 4 of the
coded with high cosine similarity for HRRBase, which top 40 cosine-similar codes were deemed to be strongly
is not the case for unstructured embeddings. The co- ontologically related by our physician with the remain-
occurrence problem can be seen in the second case study ing codes deemed to be less related and unrelated. In the
for 9916-9. The most similar codes for HRRBase are medi- case of our structured HRRBase embeddings, 28 of the
cally similar codes that would not usually co-occur, while top 40 cosine-similar codes were deemed to be strongly
for unstructured embeddings the most similar codes co- ontologically related by our physician with the remain-
occur frequently. For the final case study on K219-10, ing codes deemed to be less related and unrelated. This
frequency-related bias can be observed in the unstruc- suggests that knowledge-integrated structured embed-
tured embeddings with frequent but mostly ontologically dings are associated with greater clinical relevance of the
unrelated codes as part of the top list of cosine similar top cosine-similar codes than unstructured embeddings
codes, whereas the top list of cosine similar codes for for rare codes where little training data exists.
HRRBase contains medically similar codes.
We broadened this case study to test statistical dif-
ferences in cosine and semantic embedding similarity
4. Discussion to benefit from training of the other, and may help to
align the representations of codes and text.
Transformers have leading performance in many applica-
tions, but their internal processes are opaque, emerging
from enormous parameter sets and data volumes beyond 5. Conclusion
human experience. It is hard to know when they can
be trusted. For example, generative transformers are We proposed a novel hybrid neural-symbolic approach
prone to subtle confabulations. Transformers have a called HRR-BERT that integrates medical ontologies rep-
general-purpose architecture that performs as well in resented by HRR embeddings. In tests with the MIMIC-
vision and other modalities as in language. They are a IV dataset, HRRBERT models modestly outperformed
culmination of a key trend in artificial intelligence, away baseline models with unstructured embeddings for pre-
from problem-specific engineering, and toward massive training, disease prediction accuracy, mortality predic-
data and computation. This trend is justified in terms of tion F1, and fine-tuning tasks involving infrequently seen
performance. However, given two models with equal per- codes. HRRBERT models had pronounced performance
formance, one with more explicit conceptual structure is advantages in MLM with rare codes and disease pre-
preferable in terms of trust and explainability. diction for patients with no codes seen during training
The work presented here is a step in this direction, with (ROOD - Unseen in Table 1). We also showed that HRRs
our HRRBase embeddings that have explicit conceptual can be used to create medical code embeddings that bet-
structure and perform equivalently or better compared ter respect ontological similarities for rare codes. A key
to typical transformer embeddings. The benefit of struc- benefit of our approach is that it facilitates explainability
tured embeddings becomes more pronounced for tasks by disentangling token-frequency information, which
that involve codes that are rare or are not present in is prominently represented but implicit in unstructured
training data. HRR embeddings can also be relied on to embeddings.
represent medical meaning rather than co-occurrence in
the training data. They also untangle the representation References
of code frequency, so that it can be included or not, and
its effects on decisions understood. Importantly, despite [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
this additional structure, the embeddings are thoroughly L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At-
learned, suggesting that the approach will be consistent tention is all you need, 2017. arXiv:1706.03762 .
with high performance beyond the examples we have [2] L. Rasmy, Y. Xiang, Z. Xie, C. Tao, Med-bert:
studied. pretrained contextualized embeddings on large-
As our method scales with and leverages PyTorch au- scale structured electronic health records for dis-
tograd in the construction of the vector-symbolic em- ease prediction, npj Digital Medicine 4 (2021) 86.
beddings, it is compatible with existing medical LLM doi:10.1038/s41746- 021- 00455- y .
architectures as an embedding component capable of [3] A. Ganesan, H. Gao, S. Gandhi, E. Raff, T. Oates,
encoding domain knowledge. J. Holt, M. McLean, Learning with holographic
Future work could explore the potential of these struc- reduced representations, CoRR abs/2109.02157
tured embeddings for explaining and controlling the ob- (2021). URL: https://arxiv.org/abs/2109.02157.
served frequency bias. As HRRs can be queried with lin- arXiv:2109.02157 .
ear operations, future work could also explore whether [4] M. K. Sarker, L. Zhou, A. Eberhart, P. Hitzler, Neuro-
transformers can learn to extract specific information symbolic artificial intelligence, AI Communications
from these composite embeddings. Limitations to ad- 34 (2021) 197–209.
dress in future work include the complexity of processing [5] S. Ramgopal, L. N. Sanchez-Pinto, C. M. Horvat,
knowledge graphs to be compatible with HRRs. Another M. S. Carroll, Y. Luo, T. A. Florin, Artificial
important limitation is that our method relies on rare- intelligence-based clinical decision support in pedi-
code HRRs sharing atomic elements with common-code atrics, Pediatric research 93 (2023) 334–341.
HRRs. However, in SNOMED CT, rare codes are likely to [6] T. Yu, T. Tuinstra, B. Hu, R. Rezai, T. Fortin, R. Di-
contain some rare atomic elements. To address this point, Maio, B. Vartian, B. Tripp, Frequency bias in mlm-
in addition to SNOMED CT, knowledge could be encoded trained bert embeddings for medical codes, CMBES
from sources such as pre-trained medical embeddings, Proceedings 45 (2023). URL: https://proceedings.
different medical ontologies, and other medical domain cmbes.ca/index.php/proceedings/article/view/1050.
knowledge to further improve our proposed methodol- [7] S. Biderman, H. Schoelkopf, Q. G. Anthony,
ogy. In LLMs that process both medical codes and text, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan,
it would make sense to share word embeddings between S. Purohit, U. S. Prashanth, E. Raff, et al., Pythia:
modalities. This would allow training of each modality A suite for analyzing large language models across
training and scaling, in: International Conference and holistic data representation, in: 2023 Design,
on Machine Learning, PMLR, 2023, pp. 2397–2430. Automation Test in Europe Conference Exhibition
[8] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, (DATE), 2023, pp. 1–6. doi:10.23919/DATE56975.
H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, 2023.10137134 .
S. Pfohl, et al., Large language models encode clini- [20] M. Nickel, L. Rosasco, T. Poggio, Holographic em-
cal knowledge, Nature 620 (2023) 172–180. beddings of knowledge graphs (2015). URL: http:
[9] T. Plate, Holographic reduced representations, IEEE //arxiv.org/abs/1510.04935. doi:10.48550/arXiv.
Transactions on Neural Networks 6 (1995) 623–641. 1510.04935 , arXiv:1510.04935 [cs, stat].
doi:10.1109/72.377968 . [21] T. Dash, A. Srinivasan, L. Vig, Incor-
[10] A. Johnson, L. Bulgarelli, T. Pollard, S. Horng, L. A. porating symbolic domain knowledge
Celi, R. Mark, Mimic-iv (version 2.0, 2022. URL: into graph neural networks, Machine
https://doi.org/10.13026/7vcr-e114. doi:10.13026/ Learning 110 (2021) 1609–1636. URL:
7vcr- e114 . https://doi.org/10.1007%2Fs10994-021-05966-z.
[11] P. Kanerva, Hyperdimensional computing: An in- doi:10.1007/s10994- 021- 05966- z .
troduction to computing in distributed representa- [22] M. Kulmanov, F. Z. Smaili, X. Gao, R. Hoehn-
tion with high-dimensional random vectors, Cog- dorf, Semantic similarity and machine learning
nitive Computation 1 (2009) 139–159. URL: https: with ontologies, Briefings in Bioinformat-
//api.semanticscholar.org/CorpusID:733980. ics 22 (2020) bbaa199. URL: https://doi.org/
[12] R. W. Gayler, Vector symbolic architectures answer 10.1093/bib/bbaa199. doi:10.1093/bib/bbaa199 .
jackendoff’s challenges for cognitive neuroscience, arXiv:https://academic.oup.com/bib/article-
2004. arXiv:cs/0412059 . pdf/22/4/bbaa199/39132158/bbaa199.pdf .
[13] P. Neubert, S. Schubert, Hyperdimensional com- [23] V. Riikka, V. Anne, P. Sari, Systematized nomencla-
puting as a framework for systematic aggregation ture of medicine-clinical terminology (snomed ct)
of image descriptors (2021). URL: http://arxiv.org/ clinical use cases in the context of electronic health
abs/2101.07720. doi:10.48550/arXiv.2101.07720 , record systems: Systematic literature review, JMIR
arXiv:2101.07720 [cs]. Med Inform (2023). doi:10.2196/43750 .
[14] P. Neubert, S. Schubert, K. Schlegel, P. Protzel, Vec- [24] L. Song, C. W. Cheong, K. Yin, W. K. Cheung,
tor semantic representations as descriptors for vi- B. C. M. Fung, J. Poon, Medical concept embed-
sual place recognition, in: Robotics: Science and ding with multiple ontological representations, in:
Systems XVII, Robotics: Science and Systems Foun- Proceedings of the Twenty-Eighth International
dation, 2021. URL: http://www.roboticsproceedings. Joint Conference on Artificial Intelligence, IJCAI-
org/rss17/p083.pdf. doi:10.15607/RSS.2021.XVII. 19, International Joint Conferences on Artificial
083 . Intelligence Organization, 2019, pp. 4613–4619.
[15] A. Rahimi, P. Kanerva, L. Benini, J. M. Rabaey, Effi- URL: https://doi.org/10.24963/ijcai.2019/641. doi:10.
cient biosignal processing using hyperdimensional 24963/ijcai.2019/641 .
computing: Network templates for combined learn- [25] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M.
ing and classification of exg signals, Proceedings of Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus,
the IEEE 107 (2019) 123–143. doi:10.1109/JPROC. G. B. Moody, C.-K. Peng, H. E. Stanley, Phys-
2018.2871163 . ioBank, PhysioToolkit, and PhysioNet: Compo-
[16] K. Schlegel, P. Neubert, P. Protzel, Hdc- nents of a new research resource for complex phys-
minirocket: Explicit time encoding in time series iologic signals, Circulation 101 (2000) e215–e220.
classification with hyperdimensional computing Circulation Electronic Pages: http://circ.ahajour-
(2022). URL: http://arxiv.org/abs/2202.08055. doi:10. nals.org/content/101/23/e215.full PMID:1085218;
48550/arXiv.2202.08055 , arXiv:2202.08055 [cs]. doi: 10.1161/01.CIR.101.23.e215.
[17] P. Smolensky, Tensor product variable binding and [26] J. Devlin, M. Chang, K. Lee, K. Toutanova,
the representation of symbolic structures in con- BERT: pre-training of deep bidirectional trans-
nectionist systems, Artificial Intelligence 46 (1990) formers for language understanding, CoRR
159–216. URL: https://www.sciencedirect.com/ abs/1810.04805 (2018). URL: http://arxiv.org/abs/
science/article/pii/000437029090007M. doi:https: 1810.04805. arXiv:1810.04805 .
//doi.org/10.1016/0004- 3702(90)90007- M . [27] NLM, Snomed ct to icd-10-cm map, 2022. URL:
[18] M. M. Alam, E. Raff, S. Biderman, T. Oates, J. Holt, https://www.nlm.nih.gov/research/umls/mapping_
Recasting self-attention with holographic reduced projects/icd9cm_to_snomedct.html.
representations, 2023. arXiv:2305.19534 . [28] OHDSI, Ohdsi standardized vocabularies, 2019.
[19] J. Kim, H. Lee, M. Imani, Y. Kim, Efficient hyper- URL: https://github.com/OHDSI/Vocabulary-v5.0/
dimensional learning with trainable, quantizable, wiki.
[29] NLM, Icd-9-cm diagnostic codes to snomed ct map, 26. G249-10: Dystonia, unspecified
2022. URL: https://www.nlm.nih.gov/research/ 27. 9100-9: Abrasion or friction burn of face, neck,
umls/mapping_projects/icd9cm_to_snomedct. and scalp except eye, without mention of infec-
html. tion
[30] NCHS, Diagnosis code set general equivalence 28. 78906-9: Abdominal pain, epigastric
mappings, 2018. URL: https://ftp.cdc.gov/pub/ 29. E8889-9: Unspecified fall
health_statistics/nchs/Publications/ICD10CM/ 30. 30500-9: Alcohol abuse, unspecified
2018/Dxgem_guide_2018.pdf.
31. G520-10: Disorders of olfactory nerve
[31] T. J. Pollard, A. E. W. Johnson, J. D. Raffa, L. A. Celi,
R. G. Mark, O. Badawi, The eICU Collaborative 32. 8020-9: Closed fracture of nasal bones
Research Database, a freely available multi-center
database for critical care research, Scientific data 5 A.1. Learning through HRR Operations
(2018) 1–13. Efficiently
To make the HRR concept embeddings useful for a deep
neural network, the operations used to form the embed-
A. List of 32 ROOD Codes dings need to be compatible with backpropagation so
that gradient descent can update the lower-level atomic
The following is the list of 32 ROOD codes: vectors. We desired a function that produced the ICD
1. G248-10: Other dystonia concept embedding matrix, 𝐶, given the inputs of the VSA
2. E8498-9: Accidents occurring in other specified knowledge graphs, 𝒢𝑖 , and symbol embedding matrices,
places 𝑅 and 𝐴.
3. E9688-9: Assault by other specified means We attempted three approaches to computing 𝐶
through VSA operations. First, we naively tried to com-
4. Z681-10: Body mass index (BMI) 19.9 or less, adult
pute each concept vector in 𝐶 one at a time. However,
5. 30550-9: Opioid abuse, unspecified
this approach was too slow in both forward and back-
6. R262-10: Difficulty in walking, not elsewhere clas- ward pass, requiring more than 1 second for each pass.
sified Our second approach was using slices of 𝐺 along the re-
7. E887-9: Fracture, cause unspecified lationship dimension as a sparse binary matrix, which,
8. R471-10: Dysarthria and anarthria when multiplied with 𝐴, would perform the indexing
9. 9916-9: Hypothermia and summing of atomic vectors for each concept. This
10. E9010-9: Accident due to excessive cold due to result can be convolved with the relationship vector and
weather conditions added to the concept embedding matrix. This approach
11. F10129-10: Alcohol abuse with intoxication, un- was much faster and used a moderate amount of memory
specified for one of our less complex VSA formulations. However,
12. E8499-9: Accidents occurring in unspecified place when dealing with our most complex formulation, it used
13. R636-10: Underweight ∼15 GB of memory.
14. 920-9: Contusion of face, scalp, and neck except Our final approach took advantage of the fact that
eye(s) many disease concepts use relationship, but to different
atomic symbols. Also, number of times a concept uses
15. R4182-10: Altered mental status, unspecified
a particular relationship is relatively low, except for the
16. 95901-9: Head injury, unspecified SNOMED “isA” relationship and our defined “descrip-
17. 78097-9: Altered mental status tion” relationship. Thus, for a particular relationship,
18. F29-10: Unspecified psychosis not due to a sub- we can contribute to building many disease concept vec-
stance or known physiological condition tors at once by selecting many atomic vectors, doing a
19. Z880-10: Allergy status to penicillin vectorized convolution with the relationship vector, and
20. Z818-10: Family history of other mental and be- distributing the results to be added with the appropriate
havioral disorders concept embedding rows. This step needs to be repeated
21. 81600-9: Closed fracture of phalanx or phalanges at most 𝑚 times for a particular relationship, where 𝑚 is
of hand, unspecified the maximum multiplicity of that relationship among all
22. 87341-9: Open wound of cheek, without mention concepts. We improved memory efficiency by perform-
of complication ing fast Fourier transforms (FFTs) on the atomic vector
23. H9222-10: Otorrhagia, left ear embeddings and construct the concept vectors by per-
24. Z978-10: Presence of other specified devices forming binding via element-wise multiplication in the
Fourier domain. Due to the linearity of the HRR opera-
25. G20-10: Parkinson’s disease
tions, we performed a final FFT on the complex-valued
concept embedding to convert back to the real domain.
The final approach is much faster than the first ap-
proach since it takes advantage of vectorized operations
to contribute to many concept vectors at once. It is also
more memory efficient than the second approach since
all the intermediate results are dense, so allocations are
not wasted on creating mostly sparse results. On our
most complex formulation, this approach uses ∼3.5 GB
of memory, and takes ∼80 ms and ∼550 ms for forward
and backward pass respectively.