<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ACM KDD Conference, August</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Representations for Transformers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bing Hu</string-name>
          <email>bingxu.hu@uwaterloo.ca</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Trevor Yu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tia Tuinstra</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ryan Rezai</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harshit Bokadia</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rachel DiMaio</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Fortin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brian Vartian</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bryan Tripp</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Deep Learning, Ontology, Knowledge-Integration</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>McMaster University</institution>
          ,
          <addr-line>Ontario</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Waterloo</institution>
          ,
          <addr-line>Ontario</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>We test our method</institution>
          ,
          <addr-line>Holographic Reduced Representa-</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>tion Bi-directional Encoder Representations from Trans-</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>26</volume>
      <issue>2024</issue>
      <abstract>
        <p>Transformer models trained on NLP tasks with medical codes often have randomly initialized embeddings that are then adjusted based on training data. For terms appearing infrequently in the dataset, there is little opportunity to improve these representations and learn semantic similarity with other concepts. Medical ontologies represent many biomedical concepts and define a relationship structure between these concepts, making ontologies a valuable source of domain-specific information. Holographic Reduced Representations (HRR) are capable of encoding ontological structure by composing atomic vectors to create structured higher-level concept vectors. We developed an embedding layer that generates concept vectors for clinical diagnostic codes by applying HRR operations that compose atomic vectors based on the SNOMED CT ontology. This approach allows for learning the atomic vectors while maintaining structure in the concept vectors. We trained a Bidirectional Encoder Representations from the Transformers (BERT) model to process sequences of clinical diagnostic codes and used the resulting HRR concept vectors as the embedding matrix for the model. The HRR-based approach introduced interpretable structure into code embeddings while maintaining or modestly improving performance on the masked language modeling (MLM) pre-training task (particularly for rare codes) as well as the fine-tuning tasks of mortality and disease prediction. This approach also better maintains semantic similarity between medically related concept vectors, due to both shared atomic vectors and disentangling of code-frequency information.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <sec id="sec-2-1">
        <title>Transformers [1] jointly optimize high-dimensional vec</title>
        <p>
          work that contextualizes and transforms these
embedtor embeddings that represent input tokens, and a net- information and potentially clinical harm [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7, 8</xref>
          ].
Here we use a novel neuro-symbolic medical
transdings to perform a task. Originally designed for natu- former architecture incorporating structured knowledge
in medical applications [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Standard large language
models (LLMs) can be prone to biases in the training data,
such as frequency bias, which can result in medical
misrecords (EHR). A prominent example in this space is Med- the architecture to optimize the embeddings of atomic
ral language processing (NLP) tasks, transformers are
now widely used with other data modalities. In medical
applications, one important modality consists of
medical codes that are extensively used in electronic health
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>BERT [2], which consumes a sequence of diagnosis codes.</title>
      </sec>
      <sec id="sec-2-3">
        <title>Tasks that Med-BERT and other EHR-transformers per</title>
        <p>form include disease and mortality prediction.</p>
        <p>
          Deep networks have traditionally been alternatives to
symbolic artificial intelligence with diferent advantages
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Deep networks use real-world data efectively, but
symbolic approaches have completive properties, such as
better transparency and capacity for incorporating
structured information, inspiring many eforts to combine the
tional transparency and ability to incorporate structured
information are potential benefits of symbolic approaches
formers (HRRBERT), on the Medical Information Mart
for Intensive Care (MIMIC)-IV dataset [10] and show
improvements in both pre-training and fine-tuning tasks.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>We also show that our embeddings of ontologically sim</title>
        <p>
          contrast with embeddings that are learned in the
standard way. Finally, we investigate learned representations
of medical-code frequency, in light of recent
demonstration of frequency bias in EHR-transformers [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          We contribute:
• A novel neuro-symbolic architecture, HRRBERT,
two approaches in neuro-symbolic systems [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Addi- ilar rare medical codes have high cosine similarity, in
that combines vector-symbolic embeddings with upon Smolensky’s by using circular convolution as the
the BERT LLM architecture, leading to better per- binding operator [9]. Circular convolution keeps the
formance in medical tasks. output in the same dimension, solving the problem of
• Eficient construction of vector-symbolic embed- exploding dimensionality.
        </p>
        <p>
          dings that leverage PyTorch autograd on GPUs. In the field of deep learning, HRRs have been used
• Optimized medical-code embeddings that better in previous work to recast self-attention for transformer
respect semantic similarity of medical terminol- models [18], to improve the eficiency of neural networks
ogy than standard embeddings for infrequently performing a multi-label classification task by using an
used codes. HRR-based output layer [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], and as a learning model itself
with a dynamic encoder that is updated through
train
        </p>
        <p>
          We focus here on processing medical codes, but our ing [19]. In all of these works, the eficiency and simple
methods would extend naturally to foundation models arithmetic of HRRs are leveraged. Our work difers in
that combine medical codes and natural language. Specif- that we also leverage the ability of HRRs to create
strucically, the trained atomic vectors of our vector-symbolic tured vectors to represent complex concepts as inputs to
embeddings could share a dictionary with language em- a transformer model.
beddings, so that training of each could improve the VSAs such as HRRs can efectively encode domain
representation of the other. knowledge, including complex concepts and the
relationships between them. For instance, Nickel et al. [20]
pro1.1. Background and Related Works pose holographic embeddings that make use of VSA
properties to learn and represent knowledge graphs.
EncodThe Vector-Symbolic Architectures (VSA) approach is a ing domain knowledge is of interest in the field of deep
computing paradigm that relies on high dimensionality learning, as it could improve, for example, a deep neural
and randomness to represent concepts as unique vectors network’s ability to leverage human knowledge and to
in a high dimensional space [11]. VSAs create and manip- communicate its results within a framework that humans
ulate distributed representations of concepts by combin- understand [21]. Ontologies are a form of domain
knowling base vectors with bundling, binding, and permutation edge incorporated into machine learning models to use
algebraic operators [12]. For example, a scene with a red background knowledge to create embeddings with
meanbox and a green ball could be described with the vector ingful similarity metrics and for other purposes [22]. In
SCENE=RED⊗BOX+GREEN⊗BALL, where ⊗ indicates our work, we use HRRs to encode domain knowledge
binding, and + indicates bundling. The atomic concepts in trainable embeddings for a transformer model. The
of RED, GREEN, BOX, and BALL are represented by base domain knowledge we use comes from the Systematized
vectors, which are typically random. VSAs also define Nomenclature of Medicine Clinical Terms (SNOMED CT),
an inverse operation that allows the decomposition of a which is a widely used clinical ontology system that
incomposite representation. For example, the scene rep- cludes definitions of relationships between clinical
conresentation could be queried as SCENE⊗BOX−1. This cepts [23].
should return the representation of GREEN or an approx- To the best of our knowledge, HRRs have not been used
imation of GREEN that is identifiable when compared to before as embeddings for transformer models.
Transa dictionary. In a VSA, the similarity between concepts former models typically use learned embeddings with
can be assessed by measuring the distance between the random initializations [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. However, in the context of
reptwo corresponding vectors. resenting ontological concepts, using such unstructured
        </p>
        <p>VSAs were proposed to address challenges in mod- embeddings can have undesirable efects. One problem
elling cognition, particularly language [12]. However, is the inconsistency between the rate of co-occurrence
VSAs have been successfully applied across a variety of or patterns of occurrence of medical concepts and their
domains and modalities outside of the area of language degree of semantic similarity described by the ontology.
as well, including in vision [13, 14], biosignal process- For example, the concepts of “Type I Diabetes” and “Type
ing [15], and time-series classification [ 16]. Regardless II Diabetes” are mutually exclusive in EHR data and do
of the modality or application, VSAs provide value by not follow the same patterns of occurrence due to
difenriching vectors with additional information, such as ferences in pathology and patient populations [24]. The
spatial semantic information in images and global time diferences in occurrence make it dificult for a
transencoding in time series. former model to learn embeddings with accurate
simi</p>
        <p>
          An early VSA framework was Smolensky’s Tensor larity metrics. The concepts should have relatively high
Product Representation [17], which addressed the need similarity according to the ontology. They both share a
for compositionality, but sufered from exploding model common ancestor of “Diabetes Mellitus,” they are both
dimensionality. The VSA framework introduced by Plate, metabolic disorders that afect blood glucose levels, and
Holographic Reduced Representations (HRR), improved they can both lead to similar health outcomes. Song et al.
[24] seeks to address this type of inconsistency by train- [
          <xref ref-type="bibr" rid="ref11">26</xref>
          ], including the atomic vectors for HRR embeddings.
ing multiple “multi-sense” embeddings for each non-leaf Fine-tuning used a constant learning rate schedule with a
node in an ontology’s knowledge graph via an attention weight decay of 4e-6. Fine-tuning lasted 10 epochs with
mechanism. However, the “multi-sense” embeddings do a batch size of 80.
not address the learned frequency-related bias that also
arises from the co-occurrence of concepts. Frequency- 2.3. Encoding SNOMED Ontology with
related bias raises an explainability issue, as it leads to
learned embeddings that do not reflect true similarity HRR Embeddings
relationships between concepts, for example, as defined In this section, we detail the methodologies of
constructin an ontology, but instead reflect the frequency of the ing vector embeddings for ICD disease codes using HRR
concepts in the dataset [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. This bias particularly afects operations based on the SNOMED CT structured
clinicodes that are used less frequently. cal vocabulary. We first describe our mapping from ICD
        </p>
        <p>Our proposed approach, HRRBERT, uses the structure concepts to SNOMED CT terms. Next, we define how
from SNOMED CT to represent thousands of concepts the atomic symbols present in the SNOMED CT ontology
with high-dim-ensional vectors such that each vector are combined using HRR operations to construct
conreflects a particular clinical meaning and can be compared cept vectors for the ICD codes. Finally, we describe our
to other vectors using the HRR similarity metric, cosine method to eficiently compute the HRR embedding
masimilarity. It also leverages the computing properties of trix using default PyTorch operations that are compatible
HRRs to provide structured embeddings for a LLM that with autograd.
supports optimization through backpropagation.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Methods</title>
      <sec id="sec-3-1">
        <title>2.1. MIMIC-IV Dataset</title>
        <sec id="sec-3-1-1">
          <title>The data used in this study was derived from the Med</title>
          <p>ical Information Mart for Intensive Care (MIMIC) v2.0
database, which is composed of de-identified EHRs from
in-patient hospital visits between 2008 and 2019 [10].
MIMIC-IV is available through PhysioNet [25]. We used
the ICD-9 and ICD-10 diagnostic codes from the
icd_diagnosis table from the MIMIC-IV hosp module. We
filtered patients who did not have at least one diagnostic
code associated with their records. Sequences of codes
were generated per patient by sorting their hospital visits
by time. Within one visit, the order of codes from the
MIMIC-IV database was used, since it represents the
relative importance of the code for that visit. Each unique
code was assigned a token. In total, there were 189,980
patient records in the dataset. We used 174,890 patient
records for pre-training, on which we performed a 90–10
training-validation split. We reserved 15k records for
ifne-tuning tasks.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Model Architecture</title>
        <sec id="sec-3-2-1">
          <title>We utilized a BERT-base model architecture with a post</title>
          <p>
            layer norm position and a sequence length of 128 ICD
codes [
            <xref ref-type="bibr" rid="ref11">26</xref>
            ]. A custom embedding class was used to
support the functionality required for our HRR embeddings.
We adapted the BERT segment embeddings to represent
groups of codes from the same hospital visit, using up
to 100 segment embeddings to encode visit sequencing.
An embedding dimension of  = 768 was used, and all
embeddings were initialized from  ∼   (0, 0.02), as in
2.3.1. Mapping ICD to SNOMED CT Ontology
          </p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Our data uses ICD-9 and ICD-10 disease codes while</title>
          <p>
            our symbolic ontology is defined in SNOMED CT, so we
required a mapping from the ICD to the SNOMED CT
system to build our symbolic architecture. We used the
SNOMED CT International Release from May 31, 2022
[23] and only included SNOMED CT terms that were
active at the time of that release. While SNOMED
publishes a mapping tool from SNOMED CT to ICD-10, a
majority of ICD-10 concepts have one-to-many mappings
in the ICD-to-SNOMED CT direction [
            <xref ref-type="bibr" rid="ref12">27</xref>
            ]. To increase
the fraction of one-to-one mappings, we used additional
published mappings from the Observational Medical
Outcomes Partnership (OMOP) [
            <xref ref-type="bibr" rid="ref13">28</xref>
            ], mappings from ICD-9
directly to SNOMED CT [
            <xref ref-type="bibr" rid="ref14 ref8">29</xref>
            ], and mappings from ICD-10
to ICD-9 [
            <xref ref-type="bibr" rid="ref15 ref9">30</xref>
            ].
          </p>
          <p>Notably, after excluding ICD codes with no active
SNOMED CT mapping, 671 out of the 26,164 unique
ICD codes in the MIMIC-IV dataset were missing
mappings. When those individual codes were removed, a
data volume of 4.62% of codes was lost. This removed 58
out of 190,180 patients from the dataset, as they had no
valid ICD codes in their history. Overall, the remaining
25,493 ICD codes mapped to a total of 12,263 SNOMED
CT terms.
2.3.2. SNOMED CT vector symbolic architecture
Next, we define how the contents of the SNOMED CT
ontology were used to construct a symbolic graph to
represent ICD concepts. For a given SNOMED CT term,
we used its descriptive words and its relationships to
other SNOMED CT terms. A relationship is defined by
a relationship type and a target term. In total, there
concepts. In the ontology, many ICD concepts share</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>SNOMED CT terms in their representations.</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>The set of relationships was not necessarily unique</title>
          <p>for each SNOMED CT term. To add more unique
information, we used a term’s “fully specified name” and
any “synonyms” as an additional set of words describing
that term. We set all text to lowercase, stripped
punctuation, and split on spaces to create a vocabulary of words.</p>
        </sec>
        <sec id="sec-3-2-5">
          <title>We removed common English stopwords from a custom stopword list that was collected with assistance from a medical physician. The procedure resulted in a total of 8833 vocabulary words.</title>
        </sec>
        <sec id="sec-3-2-6">
          <title>Overall, there were a total of 22,725 “atomic” symbols</title>
          <p>for the VSA which included the SNOMED CT terms,
relationships, and the description vocabulary. Each symbol
was assigned an “atomic vector”. We built a “concept
vector” for each of the target 25,493 ICD codes using HRR
operations to combine atomic vectors according to the</p>
        </sec>
        <sec id="sec-3-2-7">
          <title>SNOMED CT ontology structure.</title>
        </sec>
        <sec id="sec-3-2-8">
          <title>To build a  -dimensional concept vector for a given</title>
          <p>ICD concept, we first considered the set of all
relationships that the concept maps to. We used the HRR
operator for binding, circular convolution, to combine vectors
representing the relationship type and destination term
and defined the concept vector to be the bundling of
these bound relationships. For the description words,
we bundled the vectors representing each word together
and bound this result with a new vector representing the
relationship type “description,” as shown in Equation 1.
 ICD concept =</p>
          <p>∑
SNOMED CT
 rel ⊛  term + ∑  desc ⊛  word
words
Formally, let  ∶ {1, 2, ..., 
 } be the set of integers
enumerating the unique atomic symbols for SNOMED</p>
        </sec>
        <sec id="sec-3-2-9">
          <title>CT terms and description words. Let  ∶ {1, 2, ...,</title>
          <p>the set of integers enumerating unique relationships for
SNOMED CT terms, including the description
relation } be
ship and the binding identity. Let  ∶ {1, 2, ..., 
set of integers enumerating the ICD-9 and ICD-10 disease
concepts represented by the VSA.</p>
          <p>has an associated embedding matrix 
 } be the
∈ ℝ  × ,
where atomic vector   =  [,∶] ,  ∈ 
is the  -th row
the embedding matrix. Similarly, there is relationship
(1)
(2)
and an ICD concept embedding matrix, 
∈ ℝ  × and
  =  [,∶] ,  ∈  . We described the VSA with the formula
in Equation 2, where   is a graph representing the
connections between ICD concept  to atomic symbols  by
relationship  .</p>
          <p>=</p>
          <p>∑   ⊛  
(,)∈</p>
        </sec>
        <sec id="sec-3-2-10">
          <title>Additional details on how to eficiently use PyTorch</title>
          <p>were 13,852 SNOMED CT target terms and 40 SNOMED
autograd to learn through these HRR operations are
proembedding matrix,  ∈ ℝ  × and   =  [,∶] ,  ∈  ; For each of the 3 pre-trained models, 10 fine-tuning trials
2.3.3. Embedding Configurations</p>
        </sec>
        <sec id="sec-3-2-11">
          <title>We call our method of constructing embeddings for ICD</title>
          <p>codes purely from HRR representations “HRRBase” and
the standard method of creating transformer token
embeddings from random vectors “unstructured”. While the</p>
        </sec>
        <sec id="sec-3-2-12">
          <title>HRRBase configuration enforces the ontology structure,</title>
          <p>we wondered whether it would be too rigid and have
dififculty representing information not present in SNOMED</p>
        </sec>
        <sec id="sec-3-2-13">
          <title>CT. As dataset frequency information for ICD medical</title>
          <p>codes is not present in the HRR structure, we tried adding
an embedding that represented the empirical frequency
of that ICD code in the dataset. We also tried adding fully
learnable embeddings with no prior structure.</p>
        </sec>
        <sec id="sec-3-2-14">
          <title>Given the wide range of ICD code frequencies in</title>
          <p>
            MIMIC, we log-transformed the empirical ICD code
frequencies, and then discretized the resulting range. For
our HRRFreq configuration, we used the sinusoidal
frequency encoding as in [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] to encode the discretized
logfrequency information. The frequency embeddings were
normalized before being summed with the HRR
embedding vectors.
          </p>
        </sec>
        <sec id="sec-3-2-15">
          <title>We defined two additional configurations in which</title>
          <p>a standard embedding vector was integrated with the
structured HRR concept vector. With “HRRAdd”, a
learnable embedding was added to the concept embedding,
HRRAdd =  +  add,  add ∈ ℝ  × . However, this roughly
doubled the number of learnable parameters compared
to other formulations.</p>
          <p>With “HRRCat”, a learnable embedding of dimension
/2</p>
          <p>was concatenated with the HRR concept
embedding of dimension /2 . This keeps the total number
of learnable parameters roughly the same as the
unstructured configuration (25,493  -dimensional vectors) and
the HRRBase configuration (22,725  -dimensional
vectors). The final embedding matrix was defined as HRRCat
= [  cat], where  ,  cat ∈    ×/2 .</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>2.4. Experiments</title>
        <sec id="sec-3-3-1">
          <title>We pre-trained the unstructured, HRRBase, HRRCat, and</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>HRRAdd embedding configurations of HRRBERT on the</title>
          <p>masked language modelling (MLM) task, for 3 trials each.
were conducted for a total of 30 trials per fine-tuning task.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>The best checkpoint from the 10 epochs of fine-tuning</title>
          <p>was saved based on validation performance. A test set
containing 666 patient records was used to evaluate each
of the fine-tuned models for both mortality and disease
prediction. We report accuracy, precision, recall, and
F1 scores averaged over the 30 trials for the fine-tuning
tasks.
3. Experimental Results
3.1. Pre-training
date. A training set of 13k patient records along with a
validation set of 2k patient records were used to fine-tune
each model on mortality prediction. Table 1 shows the
evaluation results of mortality prediction for each of the
configurations. We performed a two-sided Dunnett’s test
to compare our multiple experimental HRR embedding
configurations to the control unstructured embeddings,
with  &lt; 0.05 significance level. HRRBase embeddings
had a significantly greater mean F1-score ( = 0.043)
and precision ( = 0.042) compared to unstructured
embeddings.
3.2.2. Disease Prediction Task</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>The disease prediction task is defined as predicting which</title>
          <p>Figure 1: Pre-training validation set evaluation results for disease chapters were recorded in the patient’s last visit
diferent configurations using information from earlier visits. We converted all
ICD codes in a patient’s last visit into a multi-label
bi</p>
          <p>MLM accuracy is evaluated on a validation set over the nary vector of disease chapters. As there are 22 disease
course of pre-training. Pre-training results for diferent chapters defined in ICD-10, the multi-label binary vector
configurations are shown in Figure 1. The pre-training has a size of 22 with binary values corresponding to the
results are averaged over 3 runs for each of the configu- presence of a disease in each chapter. A training set of
rations except for HRRFreq where only 1 model run was 4.5k patient records along with a validation set of 500
completed. patient records were used to fine-tune each model on</p>
          <p>The baseline of learned unstructured embeddings has this task. Table 1 shows the evaluation results of disease
a peak pre-training validation performance of around prediction for each of the configurations. For the
two33.4%. HRRBase embeddings perform around 17% worse sided Dunnett test, Levene’s test shows that the equal
compared to the baseline of learned unstructured embed- variance condition is satisfied, and the Shapiro-Wilk test
dings. We hypothesize that this decrease in performance suggests normal distributions except for HRRAdd
accuis due to a lack of embedded frequency information in racy. The test showed HRRBase embeddings had a
signifHRRBase compared to learned unstructured embeddings. icantly greater mean accuracy ( = 0.033) and precision
HRRFreq (which combines SNOMED CT information ( = 0.023) compared to unstructured embeddings. No
with frequency information) has a similar performance other comparisons of mean metrics for HRR embeddings
compared to unstructured embeddings, supporting this were significantly greater than the control.
hypothesis. Compared to baseline, HRRAdd and HRRCat
improve pre-training performance by a modest margin of 3.2.3. eICU Mortality Prediction
around 2%. We posit that this almost 20% increase in
performance of HRRCat and HRRAdd over HRRBase during
pre-training is partly due to the fully learnable
embedding used in HRRCat and HRRAdd learning frequency
information.</p>
          <p>
            An additional experiment conducted on the Philips
Electronic Intensive Care Unit (eICU) [
            <xref ref-type="bibr" rid="ref10 ref16">31</xref>
            ] shows
corroborating results with the MIMIC-IV experiments. For our
experiment, we applied our mortality prediction models
that were fine-tuned on MIMIC-IV to eICU data to see
if our results generalize. Table 1 shows that HRRBase
embeddings had a significantly greater mean accuracy
( = 0.046 ) compared to unstructured embeddings when
applied to the eICU dataset. These models are not
optimized for mortality prediction for other hospitals where
coding methodology and clinical practice may difer. For
example, the most common code in the eICU dataset
represents acute respiratory failure, whereas the most
common code in the MIMIC-IV dataset represents
hypertension.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.2. Fine-tuning</title>
        <sec id="sec-3-4-1">
          <title>We fine tuned the networks for mortality prediction and disease prediction. Across metrics and tasks, the best results were often seen in HRRBase (Table 1) with some being statistically significant.</title>
          <p>3.2.1. Mortality Prediction Task</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>The mortality prediction task is defined as predicting patient mortality within 6 months after the last visit. Binary mortality labels were generated by comparing the time diference between the last visit and the mortality</title>
          <p>We conducted an additional disease-prediction
experiment to test generalization to patients with codes outside
the training distribution. We found six patients with
records that consisted of only 32 codes between them
(see list of codes in Appendix A). We created a
reallyout-of-distribution (ROOD) dataset that consisted of all
patients in MIMIC-IV (nearly 30K) with at least one of
these codes. We used this as a validation set. The
separate pre-training and fine-tuning dataset did not contain
these codes. We also created a smaller validation dataset
consisting of the six patients with only these codes.
During pretraining, the HRRBase and unstructured models
did not encounter any examples using the 32 ROOD codes
and so did not explicitly learn representations for those
codes. The trained models were then tested using the
ROOD dataset.</p>
          <p>Results from Table 1 on ROOD dataset disease
prediction show that HRRBase outperforms the unstructured
embedding model for contexts of entirely unseen codes.
We assess statistical significance using two-tailed,
independent t-test with unequal variance, as some
measurements failed Levene’s test for equal variance. The
means of all the metrics for HRRBase are significantly
greater than for unstructured when making inferences
on patients with entirely unseen codes,  &lt; 0.001 for all
metrics. Given the embedded ontological structure, we
hypothesize that HRRBase implicitly learns useful
embeddings for the 32 unseen ROOD codes by learning any
shared embedding components of the VSA when training</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.3. t-SNE of Frequency Bias</title>
        <sec id="sec-3-5-1">
          <title>We computed t-SNE dimension reductions to visual</title>
          <p>
            ize relationships among ICD code embeddings in the
pre-trained models. Figure 2 shows that unstructured
embeddings of common ICD codes are clustered together
with a large separation from those of uncommon codes.
This suggests that code-frequency information is
prominently represented in these embeddings, consistent with
frequency bias in related models [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]. Common and
uncommon code clusters are less distinct in HRRBase, which
does not explicitly encode frequency information.
          </p>
          <p>As shown in Figure 1, adding code-frequency
information to the structured HRRBase embeddings, i.e. the
HRRFreq embeddings, improved the pre-training loss be
similar to unstructured embeddings. This suggests that
unstructured components in HRRAdd and HRRCat may
have learned some frequency information, since these
losses are also similar to the loss of models with
Unstructured embeddings. To investigate whether this occurred,
we performed t-SNE dimension reductions of the
unstructured components of HRRAdd and HRRCat and colored
the points by code frequency, shown in Figure 3. This
graph suggests that these additional unstructured
embeddings learn some frequency information, due to
clustering of high frequency codes. However, the frequency
information learned by HRRCat and HRRAdd learnable
embeddings influence overall embeddings less strongly
in comparison to unstructured embeddings as seen in
Figure 2, where low frequency embeddings are less distinctly
separated from higher frequency embeddings.</p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>3.4. Top-k Accuracy for MLM</title>
        <p>Accurately predicting infrequently used disease codes
is an important clinically relevant task. Given that the
model trains and sees more common codes compared
to rare codes, rare codes are naturally challenging to
predict. Through promising empirical results on
out-ofdistribution mortality prediction for eICU and disease
prediction on ROOD, we hypothesized that our HRR
embedding models should have improved accuracy when
predicting rare codes in the dataset compared to
unstructured embedding models, since rare codes should share
some atomic vectors in their representations with
common codes. Figure 5: The top-100 MLM accuracy for binned code
frequen</p>
        <p>To test this, we evaluated the accuracy of an MLM cies in log scale. Common codes are in frequency bin 0 with
pre-trained model predicting a single masked code of a rarest codes being in frequency bin -12. 0.05, 0.01, and 0.001
known frequency. We split the codes in the pre-training significance levels comparing to unstructured embeddings are
validation dataset into 7 bins from log frequency -14 to 0, indicated with 1, 2, and 3 asterisks respectively.
such that each bin has a width of 2. The most common
codes are in a bin with log frequencies between -2 and 0,
while the rarest codes are from a bin with log frequencies ent frequency bins, averaged across the three pre-training
between -14 and -12. From each bin, we selected 400 models per configuration. Significant comparisons to the
codes at random, repeating codes from that bin if there unstructured control at a  &lt; 0.05 level indicated with
were fewer than 400. For each of these codes, we selected an asterisk. We assess statistical significance for each bin
one patient that had that code in their history, masked using a two-tailed Dunnett’s test comparing mean
accuthat code as would be done in MLM, and created a dataset racy scores of experimental HRR configurations against
of these 2,800 patients to use for MLM inference. the control unstructured configuration. Notably, the
top</p>
        <p>Figure 4 and Figure 5, respectively, show the MLM top- 100 accuracy in frequency bin -12 is non-zero for the
10 and Top-100 accuracy on predicting codes in the difer- HRR methods. These codes in the rarest bin occur only</p>
        <p>Unstructured HRRBase
Frostbite of hand 0.418 Hypothermia, initial encounter</p>
        <p>Frostbite of foot 0.361 Hypothermia not with low env. temp.</p>
        <p>Drowning and nonfatal submersion 0.352 Efect of reduced temp., initial encounter</p>
        <p>Immersion foot 0.341 Other specified efects of reduced temp.</p>
        <p>K219-10 - Gastro-esophageal reflux disease without esophagitis</p>
        <p>Unstructured HRRBase</p>
        <p>Esophageal reflux 0.565 Esophageal reflux
Hyperlipidemia, unspecified 0.335 Gastro-eso. reflux d. with esophagitis</p>
        <p>Anxiety disorder, unspecified 0.332 Reflux esophagitis
Essential (primary) hypertension 0.326 Hypothyroidism, unspecified
once in the dataset and therefore have never been used between structured and unstructured embeddings. 30
by the model for gradient updates, since they are in the ICD codes were selected from diferent frequency
catevalidation dataset. This suggests that the HRR methods gories in the dataset, with 10 codes drawn randomly from
have some ability to provide clinically relevant informa- the 300 most common codes, 10 codes drawn randomly
tion about rare codes. However, accuracy with the rarest by weighted frequency from codes appearing fewer than
codes remains too low to be of practical value, perhaps 30 times in the dataset, and 10 codes randomly selected
due to limited overlap of these codes’ atomic vectors with by weighted frequency from the entire dataset. For each
those of more common codes. selected code, the top 4 cosine-similar ICD codes were
assessed by a physician for ontological similarity.
3.5. Medical Code Case Study For each frequency category, a one-tailed Fisher’s exact
test was conducted to determine whether a relationship
Table 2 shows case studies for codes Other and un- existed between embedding type and clinical relatedness.
specified hyperlipidemia (2724-9), Hypothermia (9916-9), We found that results in the case of the rare codes were
and Gastro-esophageal Reflux disease without esophagi- statistically significant, with  = 2.44×10 −8. With 10 rare
tis (K219-10). In the first case study for 2724-9, we ob- codes and the top 4 cosine-similar ICD codes selected for
serve highly ontologically similar codes, such as Other each rare code, there are 40 top cosine-similar codes in
tohyperlipidemia and Hyperlipidemia, unspecified , are en- tal. In the case of unstructured embeddings, only 4 of the
coded with high cosine similarity for HRRBase, which top 40 cosine-similar codes were deemed to be strongly
is not the case for unstructured embeddings. The co- ontologically related by our physician with the
remainoccurrence problem can be seen in the second case study ing codes deemed to be less related and unrelated. In the
for 9916-9. The most similar codes for HRRBase are medi- case of our structured HRRBase embeddings, 28 of the
cally similar codes that would not usually co-occur, while top 40 cosine-similar codes were deemed to be strongly
for unstructured embeddings the most similar codes co- ontologically related by our physician with the
remainoccur frequently. For the final case study on K219-10, ing codes deemed to be less related and unrelated. This
frequency-related bias can be observed in the unstruc- suggests that knowledge-integrated structured
embedtured embeddings with frequent but mostly ontologically dings are associated with greater clinical relevance of the
unrelated codes as part of the top list of cosine similar top cosine-similar codes than unstructured embeddings
codes, whereas the top list of cosine similar codes for for rare codes where little training data exists.
HRRBase contains medically similar codes.</p>
        <p>We broadened this case study to test statistical
differences in cosine and semantic embedding similarity</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>Transformers have leading performance in many
applications, but their internal processes are opaque, emerging
from enormous parameter sets and data volumes beyond
human experience. It is hard to know when they can
be trusted. For example, generative transformers are
prone to subtle confabulations. Transformers have a
general-purpose architecture that performs as well in
vision and other modalities as in language. They are a
culmination of a key trend in artificial intelligence, away
from problem-specific engineering, and toward massive
data and computation. This trend is justified in terms of
performance. However, given two models with equal
performance, one with more explicit conceptual structure is
preferable in terms of trust and explainability.</p>
      <p>The work presented here is a step in this direction, with
our HRRBase embeddings that have explicit conceptual
structure and perform equivalently or better compared
to typical transformer embeddings. The benefit of
structured embeddings becomes more pronounced for tasks
that involve codes that are rare or are not present in
training data. HRR embeddings can also be relied on to
represent medical meaning rather than co-occurrence in
the training data. They also untangle the representation
of code frequency, so that it can be included or not, and
its efects on decisions understood. Importantly, despite
this additional structure, the embeddings are thoroughly
learned, suggesting that the approach will be consistent
with high performance beyond the examples we have
studied.</p>
      <p>As our method scales with and leverages PyTorch
autograd in the construction of the vector-symbolic
embeddings, it is compatible with existing medical LLM
architectures as an embedding component capable of
encoding domain knowledge.</p>
      <p>Future work could explore the potential of these
structured embeddings for explaining and controlling the
observed frequency bias. As HRRs can be queried with
linear operations, future work could also explore whether
transformers can learn to extract specific information
from these composite embeddings. Limitations to
address in future work include the complexity of processing
knowledge graphs to be compatible with HRRs. Another
important limitation is that our method relies on
rarecode HRRs sharing atomic elements with common-code
HRRs. However, in SNOMED CT, rare codes are likely to
contain some rare atomic elements. To address this point,
in addition to SNOMED CT, knowledge could be encoded
from sources such as pre-trained medical embeddings,
diferent medical ontologies, and other medical domain
knowledge to further improve our proposed
methodology. In LLMs that process both medical codes and text,
it would make sense to share word embeddings between
modalities. This would allow training of each modality
to benefit from training of the other, and may help to
align the representations of codes and text.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <sec id="sec-5-1">
        <title>We proposed a novel hybrid neural-symbolic approach</title>
        <p>called HRR-BERT that integrates medical ontologies
represented by HRR embeddings. In tests with the
MIMICIV dataset, HRRBERT models modestly outperformed
baseline models with unstructured embeddings for
pretraining, disease prediction accuracy, mortality
prediction F1, and fine-tuning tasks involving infrequently seen
codes. HRRBERT models had pronounced performance
advantages in MLM with rare codes and disease
prediction for patients with no codes seen during training
(ROOD - Unseen in Table 1). We also showed that HRRs
can be used to create medical code embeddings that
better respect ontological similarities for rare codes. A key
benefit of our approach is that it facilitates explainability
by disentangling token-frequency information, which
is prominently represented but implicit in unstructured
embeddings.
training and scaling, in: International Conference and holistic data representation, in: 2023 Design,
on Machine Learning, PMLR, 2023, pp. 2397–2430. Automation Test in Europe Conference Exhibition
[8] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, (DATE), 2023, pp. 1–6. doi:10.23919/DATE56975.</p>
        <p>H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, 2023.10137134.</p>
        <p>S. Pfohl, et al., Large language models encode clini- [20] M. Nickel, L. Rosasco, T. Poggio, Holographic
emcal knowledge, Nature 620 (2023) 172–180. beddings of knowledge graphs (2015). URL: http:
[9] T. Plate, Holographic reduced representations, IEEE //arxiv.org/abs/1510.04935. doi:10.48550/arXiv.</p>
        <p>
          Transactions on Neural Networks 6 (1995) 623–641. 1510.04935, arXiv:1510.04935 [cs, stat].
doi:10.1109/72.377968. [21] T. Dash, A. Srinivasan, L. Vig,
Incor[10] A. Johnson, L. Bulgarelli, T. Pollard, S. Horng, L. A. porating symbolic domain knowledge
Celi, R. Mark, Mimic-iv (version 2.0, 2022. URL: into graph neural networks, Machine
https://doi.org/10.13026/7vcr-e114. doi:10.13026/ Learning 110 (2021) 1609–1636. URL:
7vcr-e114. https://doi.org/10.1007%2Fs10994-021-05966-z.
[11] P. Kanerva, Hyperdimensional computing: An in- doi:10.1007/s10994-021-05966-z.
troduction to computing in distributed representa- [22] M. Kulmanov, F. Z. Smaili, X. Gao, R.
Hoehntion with high-dimensional random vectors, Cog- dorf, Semantic similarity and machine learning
nitive Computation 1 (2009) 139–159. URL: https: with ontologies, Briefings in
Bioinformat//api.semanticscholar.org/CorpusID:733980. ics 22 (2020) bbaa199. URL: https://doi.org/
[12] R. W. Gayler, Vector symbolic architectures answer 10.1093/bib/bbaa199. doi:10.1093/bib/bbaa199.
jackendof’s challenges for cognitive neuroscience,
arXiv:https://academic.oup.com/bib/article2004. arXiv:cs/0412059. pdf/22/4/bbaa199/39132158/bbaa199.pdf.
[13] P. Neubert, S. Schubert, Hyperdimensional com- [23] V. Riikka, V. Anne, P. Sari, Systematized
nomenclaputing as a framework for systematic aggregation ture of medicine-clinical terminology (snomed ct)
of image descriptors (2021). URL: http://arxiv.org/ clinical use cases in the context of electronic health
abs/2101.07720. doi:10.48550/arXiv.2101.07720, record systems: Systematic literature review, JMIR
arXiv:2101.07720 [cs]. Med Inform (2023). doi:10.2196/43750.
[14] P. Neubert, S. Schubert, K. Schlegel, P. Protzel, Vec- [24] L. Song, C. W. Cheong, K. Yin, W. K. Cheung,
tor semantic representations as descriptors for vi- B. C. M. Fung, J. Poon, Medical concept
embedsual place recognition, in: Robotics: Science and ding with multiple ontological representations, in:
Systems XVII, Robotics: Science and Systems Foun- Proceedings of the Twenty-Eighth International
dation, 2021. URL: http://www.roboticsproceedings. Joint Conference on Artificial Intelligence,
IJCAIorg/rss17/p083.pdf. doi:10.15607/RSS.2021.XVII. 19, International Joint Conferences on Artificial
083. Intelligence Organization, 2019, pp. 4613–4619.
[15] A. Rahimi, P. Kanerva, L. Benini, J. M. Rabaey, Efi- URL: https://doi.org/10.24963/ijcai.2019/641. doi:10.
cient biosignal processing using hyperdimensional 24963/ijcai.2019/641.
computing: Network templates for combined learn- [25] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M.
ing and classification of exg signals, Proceedings of Hausdorf, P. C. Ivanov, R. G. Mark, J. E. Mietus,
the IEEE 107 (2019) 123–143. doi:10.1109/JPROC. G. B. Moody, C.-K. Peng, H. E. Stanley,
Phys2018.2871163. ioBank, PhysioToolkit, and PhysioNet:
Compo[16] K. Schlegel, P. Neubert, P. Protzel, Hdc- nents of a new research resource for complex
physminirocket: Explicit time encoding in time series iologic signals, Circulation 101 (2000) e215–e220.
classification with hyperdimensional computing Circulation Electronic Pages:
http://circ.ahajour(2022). URL: http://arxiv.org/abs/2202.08055. doi:10. nals.org/content/101/23/e215.full PMID:1085218;
48550/arXiv.2202.08055, arXiv:2202.08055 [cs]. doi: 10.1161/01.CIR.101.23.e215.
[17] P. Smolensky, Tensor product variable binding and [
          <xref ref-type="bibr" rid="ref11">26</xref>
          ] J. Devlin, M. Chang, K. Lee, K. Toutanova,
the representation of symbolic structures in con- BERT: pre-training of deep bidirectional
transnectionist systems, Artificial Intelligence 46 (1990) formers for language understanding, CoRR
159–216. URL: https://www.sciencedirect.com/ abs/1810.04805 (2018). URL: http://arxiv.org/abs/
science/article/pii/000437029090007M. doi:https: 1810.04805. arXiv:1810.04805.
        </p>
        <p>
          //doi.org/10.1016/0004-3702(90)90007-M. [
          <xref ref-type="bibr" rid="ref12">27</xref>
          ] NLM, Snomed ct to icd-10-cm map, 2022. URL:
[18] M. M. Alam, E. Raf, S. Biderman, T. Oates, J. Holt, https://www.nlm.nih.gov/research/umls/mapping_
Recasting self-attention with holographic reduced projects/icd9cm_to_snomedct.html.
representations, 2023. arXiv:2305.19534. [
          <xref ref-type="bibr" rid="ref13">28</xref>
          ] OHDSI, Ohdsi standardized vocabularies, 2019.
[19] J. Kim, H. Lee, M. Imani, Y. Kim, Eficient hyper- URL: https://github.com/OHDSI/Vocabulary-v5.0/
dimensional learning with trainable, quantizable, wiki.
        </p>
        <sec id="sec-5-1-1">
          <title>A.1. Learning through HRR Operations</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>Eficiently</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>A. List of 32 ROOD Codes</title>
      <sec id="sec-6-1">
        <title>The following is the list of 32 ROOD codes:</title>
        <p>To make the HRR concept embeddings useful for a deep
neural network, the operations used to form the
embeddings need to be compatible with backpropagation so
that gradient descent can update the lower-level atomic
vectors. We desired a function that produced the ICD
1. G248-10: Other dystonia concept embedding matrix,  , given the inputs of the VSA
2. E8498-9: Accidents occurring in other specified knowledge graphs,   , and symbol embedding matrices,
places  and  .
3. E9688-9: Assault by other specified means We attempted three approaches to computing 
4. Z681-10: Body mass index (BMI) 19.9 or less, adult through VSA operations. First, we naively tried to
compute each concept vector in  one at a time. However,
5. 30550-9: Opioid abuse, unspecified this approach was too slow in both forward and
back6. R262-10: Dificulty in walking, not elsewhere clas- ward pass, requiring more than 1 second for each pass.</p>
        <p>sified Our second approach was using slices of  along the
re7. E887-9: Fracture, cause unspecified lationship dimension as a sparse binary matrix, which,
8. R471-10: Dysarthria and anarthria when multiplied with  , would perform the indexing
9. 9916-9: Hypothermia and summing of atomic vectors for each concept. This
10. E9010-9: Accident due to excessive cold due to result can be convolved with the relationship vector and
weather conditions added to the concept embedding matrix. This approach
11. F10129-10: Alcohol abuse with intoxication, un- was much faster and used a moderate amount of memory
specified for one of our less complex VSA formulations. However,
12. E8499-9: Accidents occurring in unspecified place when dealing with our most complex formulation, it used
13. R636-10: Underweight ∼15 GB of memory.
14. 920-9: Contusion of face, scalp, and neck except Our final approach took advantage of the fact that
eye(s) many disease concepts use relationship, but to diferent
15. R4182-10: Altered mental status, unspecified atomic symbols. Also, number of times a concept uses
a particular relationship is relatively low, except for the
16. 95901-9: Head injury, unspecified SNOMED “isA” relationship and our defined
“descrip17. 78097-9: Altered mental status tion” relationship. Thus, for a particular relationship,
18. F29-10: Unspecified psychosis not due to a sub- we can contribute to building many disease concept
vecstance or known physiological condition tors at once by selecting many atomic vectors, doing a
19. Z880-10: Allergy status to penicillin vectorized convolution with the relationship vector, and
20. Z818-10: Family history of other mental and be- distributing the results to be added with the appropriate
havioral disorders concept embedding rows. This step needs to be repeated
21. 81600-9: Closed fracture of phalanx or phalanges at most  times for a particular relationship, where  is
of hand, unspecified the maximum multiplicity of that relationship among all
22. 87341-9: Open wound of cheek, without mention concepts. We improved memory eficiency by
performof complication ing fast Fourier transforms (FFTs) on the atomic vector
23. H9222-10: Otorrhagia, left ear embeddings and construct the concept vectors by
per24. Z978-10: Presence of other specified devices forming binding via element-wise multiplication in the
25. G20-10: Parkinson’s disease Fourier domain. Due to the linearity of the HRR
operations, we performed a final FFT on the complex-valued
concept embedding to convert back to the real domain.</p>
        <p>The final approach is much faster than the first
approach since it takes advantage of vectorized operations
to contribute to many concept vectors at once. It is also
more memory eficient than the second approach since
all the intermediate results are dense, so allocations are
not wasted on creating mostly sparse results. On our
most complex formulation, this approach uses ∼3.5 GB
of memory, and takes ∼80 ms and ∼550 ms for forward
and backward pass respectively.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention is all you need,
          <year>2017</year>
          . arXiv:
          <volume>1706</volume>
          .
          <fpage>03762</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rasmy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <article-title>Med-bert: pretrained contextualized embeddings on largescale structured electronic health records for disease prediction</article-title>
          ,
          <source>npj Digital Medicine</source>
          <volume>4</volume>
          (
          <year>2021</year>
          )
          <article-title>86</article-title>
          . doi:
          <volume>10</volume>
          .1038/s41746- 021- 00455- y.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ganesan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gandhi</surname>
          </string-name>
          , E. Raf,
          <string-name>
            <given-names>T.</given-names>
            <surname>Oates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Holt</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>McLean, Learning with holographic reduced representations</article-title>
          ,
          <source>CoRR abs/2109</source>
          .02157 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2109.02157. arXiv:
          <volume>2109</volume>
          .
          <fpage>02157</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Sarker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Eberhart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hitzler</surname>
          </string-name>
          , Neurosymbolic artificial intelligence,
          <source>AI</source>
          Communications
          <volume>34</volume>
          (
          <year>2021</year>
          )
          <fpage>197</fpage>
          -
          <lpage>209</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramgopal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. N.</given-names>
            <surname>Sanchez-Pinto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Horvat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Carroll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Florin</surname>
          </string-name>
          ,
          <article-title>Artificial intelligence-based clinical decision support in pediatrics</article-title>
          ,
          <source>Pediatric research 93</source>
          (
          <year>2023</year>
          )
          <fpage>334</fpage>
          -
          <lpage>341</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tuinstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rezai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fortin</surname>
          </string-name>
          , R. DiMaio,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vartian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Tripp</surname>
          </string-name>
          ,
          <article-title>Frequency bias in mlmtrained bert embeddings for medical codes</article-title>
          ,
          <source>CMBES Proceedings 45</source>
          (
          <year>2023</year>
          ). URL: https://proceedings. cmbes.ca/index.php/proceedings/article/view/1050.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Biderman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schoelkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. G.</given-names>
            <surname>Anthony</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bradley</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. O'Brien</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hallahan</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Khan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Purohit</surname>
            ,
            <given-names>U. S.</given-names>
          </string-name>
          <string-name>
            <surname>Prashanth</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Raf</surname>
          </string-name>
          , et al.,
          <article-title>Pythia: A suite for analyzing large language models across</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [29]
          <string-name>
            <surname>NLM</surname>
          </string-name>
          , Icd-9
          <article-title>-cm diagnostic codes to snomed ct map</article-title>
          ,
          <year>2022</year>
          . URL: https://www.nlm.nih.gov/research/ umls/mapping_projects/icd9cm_to_snomedct. html.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [30]
          <string-name>
            <surname>NCHS</surname>
          </string-name>
          , Diagnosis code set general equivalence mappings,
          <year>2018</year>
          . URL: https://ftp.cdc.gov/pub/ health_statistics/nchs/Publications/ICD10CM/ 2018/Dxgem_guide_
          <year>2018</year>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Pollard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E. W.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Rafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Celi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. G.</given-names>
            <surname>Mark</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Badawi,</surname>
          </string-name>
          <article-title>The eICU Collaborative Research Database, a freely available multi-center database for critical care research</article-title>
          ,
          <source>Scientific data 5</source>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          26.
          <fpage>G249</fpage>
          -10: Dystonia, unspecified
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          27.
          <fpage>9100</fpage>
          -
          <lpage>9</lpage>
          :
          <article-title>Abrasion or friction burn of face, neck, and scalp except eye, without mention of infection</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          28.
          <fpage>78906</fpage>
          -
          <lpage>9</lpage>
          : Abdominal pain, epigastric
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          29.
          <fpage>E8889</fpage>
          -9: Unspecified fall
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          30.
          <fpage>30500</fpage>
          -
          <lpage>9</lpage>
          : Alcohol abuse, unspecified
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          31.
          <fpage>G520</fpage>
          -10: Disorders of olfactory nerve
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          32.
          <fpage>8020</fpage>
          -
          <lpage>9</lpage>
          : Closed fracture of nasal bones
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>