=Paper=
{{Paper
|id=Vol-2599/CKG2019_paper_3
|storemode=property
|title=Construction of UMLS Metathesaurus with Knowledge-Infused Deep Learning
|pdfUrl=https://ceur-ws.org/Vol-2599/CKG2019_paper_3.pdf
|volume=Vol-2599
|authors=Hong Yung Yip,Vinh Nguyen,Olivier Bodenreider
|dblpUrl=https://dblp.org/rec/conf/semweb/YipNB19
}}
==Construction of UMLS Metathesaurus with Knowledge-Infused Deep Learning==
Construction of UMLS Metathesaurus with
Knowledge-Infused Deep Learning
Hong Yung Yip1 , Vinh Nguyen2 , Olivier Bodenreider2
1
Artificial Intelligence Institute, University of South Carolina, Columbia, SC, USA
2
National Library of Medicine, National Institute of Health, Bethesda, MD, USA
hyip@email.sc.edu1 , vinh.nguyen@nih.gov2 , obodenreider@mail.nih.gov2
Abstract. The Unified Medical Language System (UMLS) is a Metathe-
saurus of biomedical vocabularies developed to integrate a variety of ways
the same concepts are expressed by different terminologies and to provide
cross-walk among them. However, the current process of constructing and
inserting new resources to the existing Metathesaurus relies heavily on
lexical knowledge, semantic pre-processing, and manual audits by human
editors. This project explores the use of supervised Deep Learning ap-
proach to identify synonymy and non-synonymy among English UMLS
concepts at the atom level. We use a Siamese network with Long Short-
Term Memory and Convolutional Neural Network models to learn the
similarities and dissimilarities between pairs of atoms from the active
subset of 2019AA UMLS. To disambiguate concepts with lexically iden-
tical atoms, we contextualize the pairs with various enrichment strate-
gies that reflect the information available to the UMLS editors including
the source synonymy, hierarchical context, and source semantic group.
Learning from base lexical features of the atoms yields an overall F1-score
of 75.97%. Infusing source synonymy to the base yields a higher preci-
sion and overall F-1 score of 86.54% and 87.63% respectively. Whereas,
infusing hierarchical context trades precision for higher recall of 90.38%.
Infusing source synonymy, hierarchical context, and semantic group pro-
vides an overall increase in accuracy to 95.20%. However, infusing source
synonymy of hierarchical context does not yield any noticeable improve-
ment. A knowledge-infused learning approach provides a good perfor-
mance indicating promising potential for emulating the current building
process. Future works include evaluation with rule-based normalization
approach of constructing the Metathesaurus and investigation of the ap-
plicability, maintenance, and scalability of these models.
Keywords: Unified Medical Language System · Semantic Similarity ·
Deep Learning · Contextualized Knowledge Graph
Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
2 HY Yip et al.
1 Introduction
The Unified Medical Language System (UMLS) is a rich repository of biomedical
vocabularies developed by the US National Library of Medicine. It is an effort to
overcome challenges to effective retrieval of machine-readable information. One
of which is the variety of ways the same concepts are expressed by different ter-
minologies [1]. For example, the concept of "Addison’s Disease" is expressed as
"Primary hypoadrenalism" in the Medical Dictionary for Regulatory Activities
(MedDRA) and as "Primary adrenocortical insufficiency" in the 10th revision of
the International Statistical Classification of Diseases and Related Health Prob-
lems (ICD-10). The lack of integration between these synonymous terms often
leads to poor interoperability between information systems (i.e. how does one
map a concept from one terminology to another) and confusion among health
professionals. Hence, the UMLS aims to integrate and provide cross-walk among
various terminologies as well as facilitate the creation of more effective and in-
teroperable biomedical information systems and services, including electronic
health records 3 . Till date, it is increasingly being used in areas such as patient
care coordination, clinical coding, information retrieval, and data mining. There
are three components to the UMLS Knowledge Sources: the Metathesaurus, the
Semantic Network, and the SPECIALIST Lexicon and Lexical Tools.
The Metathesaurus is a vocabulary database organized by concept or mean-
ing. It is built from the electronic versions of various thesauri, code sets, clas-
sifications, and lists of controlled terms used in biomedical, clinical, and health
services, known as "terminologies" or interchangeably as "source vocabularies".
It connects alternative names (i.e. name variants) that are considered to be
synonymous under the same concept and identifies useful relationships between
various concepts [1]. Concepts are assigned at least one Semantic Type from the
Semantic Network to provide semantic categorization. The Lexical Tools provide
lexical information for language processing such as identifying string variants
and providing normalization as normalized string indexes to the Metathesaurus.
As of May 6, 2019, the 2019AA release of the UMLS Metathesaurus contains
approximately 3.85 million biomedical and health-related concepts and 14.6 mil-
lion concept names from 210 source vocabularies including the National Center
for Biotechnology Information (NCBI) taxonomy, Systematized Nomenclature of
Medicine - Clinical Terms (SNOMED CT), Gene Ontology, the Medical Subject
Headings (MeSH), and OMIM4 .
1.1 Construction of the UMLS Metathesaurus
The current approach of building the Metathesaurus relies on the use of lex-
ical knowledge, semantic pre-processing, and UMLS human editors. The core
3
https://www.nlm.nih.gov/research/umls/index.html
4
https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/
release/notes.html
3
idea is that synonymous terms originating from different source vocabularies are
clustered into a concept with a preferred term and a Concept Unique Identifier
(CUI). The basic building block of the Metathesaurus, also known as an "atom",
is a concept string from each of the source vocabularies. Simply put, each occur-
rence of a string in each source vocabulary is assigned a unique atom identifier
(AUI). When a lexically identical string appears in multiple source vocabularies
for example "Headache" appearing in both MeSH and ICD-10, they are assigned
different AUIs. These AUIs are then linked to a single string identifier (SUI) to
represent occurrences of the same string. Each SUI is linked to all of its En-
glish lexical variants (detected using the Lexical Variant Generator tool) by a
common term identifier (LUI). These LUIs may subsequently be linked to more
than one CUI due to strings that are lexical variants of each other have different
meanings. Table 1 illustrates how synonymous terms are clustered into a CUI.
Table 1. Metathesaurus AUI, SUI, LUI, and CUI
String (Source) AUI SUI LUI CUI
Headache (MeSH) A0066000
S0046854
Headache (ICD-10) A0065992 L0018681
C0018681
Headaches (MedDRA) A0066007
S0046855
Headaches (OMIM) A12003304
Cephalodynia (MeSH) A0540936 S0475647 L0380797
In addition, some source vocabularies provide source synonyms, hierarchical
and non-hierarchical relationships as well as metadata information for semantic
pre-processing. The UMLS human editors are involved to associate concepts
and perform manual reviews [1]. These processes of constructing and inserting
new resources to the existing Metathesaurus from identifying lexical variants
to manual audits by domain experts can be both arduous and time-consuming
given the current size of Metathesaurus comprises of over 3.85 million concepts.
Given the recent successes of supervised Deep Learning (DL) approaches in their
applications to the medical and healthcare domains [2], we hypothesize that these
DL models can be trained to emulate the current building process.
1.2 Supervised Deep Learning
Supervised DL is a learning function that maps an input to an output based
on examples of input-output pairs through layers of dense networks [3]. The
Metathesaurus comprises of approximately 10 million English atoms with each
assigned a CUI. One can simply train a supervised classifier to predict which
CUI should be labeled to a "new" atom (since atoms having the same CUI
are synonymous) as an approach to insert new resources to the current Metathe-
saurus. However, this approach is considered as an extreme classification task [4]
due to the huge prediction space of 3.85 million CUIs. Nonetheless, the CUI is
4 HY Yip et al.
merely a "mechanism" to cluster synonymous terms under the same "bucket".
We are primarily interested in whether two atoms are synonymous and hence be
labeled with the same CUI regardless of whether this CUI has already existed
in the Metathesaurus. Hence, this project is modeled as a similarity task where
we want to assess similarity based not only on the lexical features of an atom
but also based on its context (represented by the lexical features of neighboring
concepts in this source vocabulary). Concretely, a fully-trained model should
identify and learn scenarios where
1. Atoms that are lexically similar in nature but are not synonymous, e.g.,
"Lung disease and disorder" versus "Head disease and disorder"
2. Atoms that are lexically dissimilar but are synonymous, e.g., "Addison’s
disease" versus "Primary adrenal deficiency"
Similarity assessment between words and sentences, also known as Seman-
tic Text Similarity (STS) task is an active research area in Natural Language
Processing (NLP) due to its crucial role in various downstream tasks such as in-
formation retrieval, machine translation, and in our case, synonyms clustering.
The STS task can be expressed as follows: given two sentences, a system returns a
probability score of 0 to 1 indicating the degree of similarity. STS is a challenging
task due to the inherent complexity in language expressions, word ambiguities,
and variable sentence lengths. Traditional approach relies on hand-engineering
lexical features (e.g. word overlap and subwords [5], syntactic relationship [6],
structural representations [7]), linguistic resources (e.g. corpora), bag-of-words
and term frequency–inverse document frequency (TF-IDF) models that incor-
porate a variety of similarity measures [8] for example string-based [9] and term-
based [10]. However, most are syntactically and semantically constrained. Recent
successes in STS [11] in predicting sentence similarity and relatedness have been
obtained by using corpus-based [12] and knowledge-based similarity, e.g. word
embedding for feature representation [13] with supervised DL approaches, e.g.
Siamese Network with Recurrent Neural Network (RNN) [14] and Convolutional
Neural Networks (CNN) [15] to perform deep analysis of words and sentences to
learn the necessary semantics and structure.
1.3 Siamese Recurrent Architecture
Contrary to the traditional neural network which takes in one input at a time,
the Siamese network is an architecture that takes in a pair of inputs and learns
representations based on the explicit similarity and dissimilarity information (i.e.
the pair of similar and dissimilar inputs) [16]. It was originally used for signa-
ture verification [16] and has since been applied to various applications such as
face verification [17], unsupervised acoustic modeling [18], and learning semantic
entailment [14] as well as text similarity [19]. A series of DL models can be incor-
porated within the Siamese architecture. RNN is a type of DL model that excels
at processing sequential information due to the presence of memory cell to store
and "remember" data read over time [20]. Another variant of RNN is the Long
5
Short-Term Memory (LSTM). It enhances the standard RNN to handle long-
term dependencies and to minimize the inherent vanishing gradient problem of
RNN with the introduction of "gates" (input, output and forget gates) to control
the flow of and retain information better through time. It is more accurate in
handling long sequences, however, it comes at the cost of higher memory con-
sumption and slower training times compared to standard RNN which is faster
but less accurate. Nonetheless, a combination of Siamese network with RNN and
LSTM have been applied to various NLP tasks including similarity assessment
with great success [14,21,22]. On the other hand, CNN (another type of DL
model) has also performed well in NLP due to its ability to extract distinctive
features at a higher granularity [23]. A Siamese CNN model learns sentence em-
bedding and predicts sentence similarity with features from various convolution
and pooling operations [24].
In this paper, we explore the use of DL, specifically the Siamese recurrent
architecture with a combination of LSTM and CNN for the following contribu-
tions:
1. Identify synonymy and non-synonymy among English UMLS concepts at
the atom level (i.e. given two English atoms, are they synonymous and thus
belong to the same CUI?)
2. Investigate whether the DL approach could emulate the current Metathe-
saurus building process
2 Methodology
The scope of this project can be divided into four components: (i) retrieving
and parsing the UMLS dataset, (ii) generating features for learning, (iii) de-
signing the Siamese architecture, and (iv) evaluating the Siamese network with
different data enrichment strategies (i.e., infusing various knowledge pro-
vided by the source vocabularies). The UMLS dataset used in this study can be
retrieved with a UMLS license at https://www.nlm.nih.gov/research/umls/
licensedcontent/umlsknowledgesources.html.
2.1 Dataset
We use the active subset of the 2019AA UMLS and remove the derivative, du-
plicative, and spelling variants sources. The final dataset consists of 9,533,853
atoms grouped into 3,793,516 CUIs. Table 2 shows the sources removed.
2.2 Feature Engineering
The goal is to learn the similarities between pairs of atoms within a CUI and
dissimilarities between pairs of atoms from different CUIs. Prior to generating
the positive and negative pairs, we preprocess the lexical features of the atoms
6 HY Yip et al.
Table 2. Sources Removed
Sources Removed Sources
NCI_BRIDG, NCI_BioC, NCI_CDC, NCI_CDISC,
NCI_CDISC-GLOSS, NCI_CPTAC, NCI_CRCH,
NCI_CTCAE, NCI_CTCAE_3, NCI_CTCAE_5,
NCI_CTEP-SDC, NCI_CTRP, NCI_CareLex,
NCI_DCP, NCI_DICOM, NCI_DTP, NCI_EDQM-HC,
Derivative and
NCI_FDA, NCI_GAIA, NCI_GENC, NCI_ICH,
Duplicative
NCI_INC, NCI_JAX, NCI_KEGG, NCI_NCI-GLOSS,
NCI_NCI-HGNC, NCI_NCI-HL7, NCI_NCPDP,
NCI_NICHD, NCI_PI-RADS, NCI_PID, NCI_RENI,
NCI_UCUM, NCI_ZFin, HCDT, HCPT,
ICPC2P, LCH_NW
ICD10AE, ICD10AMAE, MTHICPC2EAE,
Spelling Variants
MTHICPC2ICD10AE
similar to how [25] preprocess their dataset (remove all punctuation except hy-
phen, lowercase, and tokenize by space) to ensure conformity as we leverage their
pre-trained BioWordVec embedding in our downstream network (Section 2.4).
Synonyms. We generate positive pairs based on CUI-asserted synonymy be-
tween atoms. Table 3 shows examples of positive pairs generated from one CUI.
Non-Synonyms. On the contrary, it is computationally infeasible, time and
space complexities wise, to generate all the negative pairs, which is approxi-
mately 9.5 million atoms squared since it is one atom against all other atoms
from non-related CUIs. In addition, the class imbalance between positive and
negative will induce learning bias in which the model will suffer from lower pre-
cision in detecting synonyms due to a higher preference towards non-synonyms.
Intuitively, we want the DL model to learn interesting negative pairs that are
lexically similar but differ in semantics. Hence, we adopt a heuristic approach to
reduce the sample space where we compute Jaccard index between atoms to in-
clude only negative pairs with high Jaccard similarity from different CUIs (with
a cut-off threshold of 0.6 Jaccard index) (Table 4). The pairs are then sorted
from the highest to lowest Jaccard index and the number of inclusion pairs is
shown in Table 5. The final dataset consists of pairs of strings sampled in a
1:1, 3:1, 4:1, 6:1, and 10:1 ratio of between-CUI (negative) pairs to within-CUI
(positive) pairs. These ratios are adopted from [18,19] for Siamese networks.
|A ∩ B| |A ∩ B|
JaccardIndex(A, B) = = (1)
|A ∪ B| |A| + |B| − |A ∩ B|
2.3 Experiments
The entry point of our experiment is the lexical features of an atom. However, in
order to disambiguate concepts with lexically identical atoms, e.g. the concept
"nail" with CUI "C0222001" and "C0021885" shown in Figure 1, there is a need
to contextualize the two different "nail" concepts (denoted by two distinct CUIs)
7
Table 3. Positive Pairs from a Single CUI
CUI Atom
Addison disease
Primary hypoadrenalism
C0001403
Primary adrenocortical insufficiency
Addison’s disease (disorder)
Positive Pairs
Addison disease Primary hypoadrenalism
Addison disease Primary adrenocortical insufficiency
Addison disease Addison’s disease (disorder)
Primary hypoadrenalism Primary adrenocortical insufficiency
Primary hypoadrenalism Addison’s disease (disorder)
Primary adrenocortical insufficiency Addison’s disease (disorder)
Table 4. Jaccard Computation on a Pair of Atom from Different CUIs
C0000473 C0038784
Product containing para-aminobenzoic acid Product containing sulfuric acid
Jaccard Index = Intersection (3)/ Union (5) = 0.6
Table 5. Final Dataset Size
Feature Number of Pairs
Synonyms 15,647,133
Ratio of between-CUI non-synonym pairs to within-CUI synonym pairs
1:1 15,647,133
3:1 46,941,399
4:1 62,588,532
6:1 93,882,798
10:1 156,471,330
8 HY Yip et al.
with additional features/ knowledge that indicate different meanings. Hence, we
compose the experiments (Table 6) with different data enrichment strategies
i.e. infusing various knowledge that reflect the information available to the
UMLS editors during manual construction of the Metathesaurus including the
source synonymy, hierarchical context, and source semantic group.
Table 6. Five Experimental Setup
Experiment Features
1 Base Atom Lexical Features
Base Atom Lexical Features
2
+ Source Synonymy
Base Atom Lexical Features
3 + Hierarchical Context
+ Semantic Group
Base Atom Lexical Features
+ Source Synonymy
4
+ Hierarchical Context
+ Semantic Group
Base Atom Lexical Features
+ Source Synonymy
5 + Hierarchical Context
+ Hierarchical Source Synonymy
+ Semantic Group
Base. The base consists of only the lexical features of an atom for all synonym
(positive) and non-synonym (negative) pairs.
Source synonymy. Some source vocabularies provide synonyms to the atoms
which enrich the original atom with additional lexical features that are synony-
mous. We generate these source synonyms based on the Source Concept Unique
Identifier (SCUI) of each atom.
Hierarchical context. Some source vocabularies provide hierarchical relation-
ships (ancestor-descendant or parent-child or broader-narrow relations) which
extend the original atom with surrounding contexts. We generate the hierarchi-
cal context using the unique lexical features of immediate (1-level) parents and
children based on the source relations.
Semantic group. The semantic group provides an additional layer of high-level
semantic categorization to an atom. Figure 1 shows the two concepts "nail" are
syntactically similar but they differ in semantics in which one refers to "anatomy"
and another refers to the "devices". We assign semantic group based on the
second-level concept from the root node of the original atom as a proxy to se-
mantic categorization. For source vocabularies that do not provide hierarchical
relationships, we assign a semantic group to the best knowledge of the human
editors to the source of these atoms.
9
Anatomy
pathological conditions
anatomical disease finding
Devices
Synonyms provided by the pathological conditions
device physical object
source vocabulary anatomical
nails,
fingernails, nail
CONCEPT nail
toenails DISAMBIGUATION
C0222001 C0021885
Synonyms provided by the
source vocabulary
malformed nail dystrophic nail intramedullary nail
onychodystrophy, fracture fixation intramedullary,
congenital malformed nails,
poor nail formation, osteosynthesis fracture intramedullary,
congenital onychodystrophy
nail dystrophy intramedullary nailing
Synonyms provided by the Synonyms provided by the Synonyms provided by the
source vocabulary source vocabulary source vocabulary
Fig. 1. Concepts Disambiguation. The dotted brown boxes indicate source syn-
onymy and the green boxes indicate hierarchical contexts. The dotted purple boxes
indicate source semantic group.
2.4 Siamese Models
Two different Siamese Models are designed: the Siamese LSTM and Siamese
CNN-LSTM.
Siamese LSTM. This model adopts the Siamese structure from [14] (Figure
2). A pair of atoms are first transformed into their respective numerical word
representations, i.e. embedding of word vectors. A word embedding is a language
modeling and feature learning techniques in NLP where words are mapped to
vectors of real numbers with varying dimensions. These word vectors are posi-
tioned in the vector space in a manner where words that share similar contexts
in the corpus are situated close to one another in the space [26]. Instead of train-
ing the word vectors from scratch, we leverage the pre-trained biomedical word
embedding (BioWordVec-intrinsic) with dimension size of 200 per word vector
that is trained on PubMed text corpus and MeSH data [25]. The rationale is to
"precondition" the Siamese network with prior knowledge of the inherent sim-
ilarity between words in the UMLS vocabulary. Upon plotting a word length
distribution, approximately 97% of atoms in the UMLS have a word length of
lesser or equal to 30. Hence, we apply padding or truncation to restrict the word
length of each atom to a maximum of length 30 to ensure a uniformity in di-
mension to speed up the training process. The embedding of the pair of atoms
are fed to LST MA and LST MB which each processes one of the atoms in the
given pair and consists of 50 hidden learning units. These units learn the specific
semantic and syntactic features based on word orders of each individual atoms
through time. The output of the model is a Manhattan distance similarity func-
10 HY Yip et al.
tion, exp(−||LST MA − LST MB ||1 ) ∈ [0, 1], a function that is well-suited for
high dimensional space [27]. We apply this model to Experiment 1.
Similarity between 0...1
Manhattan Distance Similarity Function
LSTMA = LSTMB
Shared Model
BioWordVec Embedding = BioWordVec Embedding
use BioWordVec to “precondition” the model with the
inherent similarity between words
Truncate/ Pad to 30 words Truncate/ Pad to 30 words
Lung disease and disorder Head disease and disorder
Fig. 2. The Siamese LSTM Model. Both left and right branch of the model share
the same weights of all the layers.
Siamese CNN-LSTM. We use this model for Experiment 2, 3, 4, and 5 to
infuse the additional knowledge and features: source synonymy, hierarchical con-
text, and semantic group information. This model adopts the Siamese structure
from [28] (Figure 3). It differs from the first architecture in its hidden learning
layers. For this model, instead of having only one embedding from the lexical
features of the atoms, we concatenate two extra vectors learned from the embed-
ding that represents the extra context information to the original atom vector.
To generate the "context bag", we extract 60 unique lexical features from source
synonyms and/or hierarchical context to enrich the base features of an atom and
sort them in alphabetical order to minimize word order randomness as the word
order is less prioritized prior to transforming them into a context embedding.
We apply one layer of CNN with 100 filters and a window size of 5 [28] with
batch normalization (to reduce overfitting) to extract and generate an interme-
diary representation and subsequently apply a layer of LSTM with 50 hidden
learning units to learn these features. Similarly, the semantic group information
is "infused" by transforming it using BioWordVec embedding and subsequently
feeding it to a layer of LSTM with 50 hidden units. The outputs of each LSTM
layer (base, context, and semantic group) are averaged over time and these three
50-dimensional vectors are concatenated and used as input to a 2-layer dense
Fully Connected (FC) network with learning units of 128 and 50 respectively
and Manhattan distance similarity function, exp(−||F CA − F CB ||1 ) ∈ [0, 1], as
the final output layer. The parameters of both models are optimized using the
Adam method [29].
11
Fully Connected Layer (50)
Fully Connected Layer (128)
Are they similar? Concatenated Vectors
“Base” Vector “Context” Vector “SG” Vector
Model = Model
LSTM LSTM LSTM
Atom 1 Atom 2 Learn word order and features Learn word features/ context Learn word order and features
CNN Conv1D
pathological Anatomy Extract word features
conditions
anatomical disease
finding pathological 200
conditions 200 200
anatomical 30 BioWordVec 60 BioWordVec 30 BioWordVec
nails,
Embeddings Embeddings Embeddings
fingernails,
toenails
nail
anatomical
conditions
congenital
malformed dystrophic nail disease anatomy
nail nail dystrophic
dystrophy
finding
fingernails
onychodystrophy, formation
congenital malformed nails,
poor nail formation, Extract only unique lexical features to malformed
congenital onychodystrophy nails
nail dystrophy enrich the “base” and sort to onychodystrophy Context Bag
“eliminate” word order randomness pathological
poor
A contextualized atom toenails
Fig. 3. The Siamese CNN-LSTM Model. Similarly, both left and right branch of
the model share the same weights of all the layers.
Each experiment (Experiment 1, 2, 3, 4, 5) is trained against five various
proportions (1:1, 3:1, 4:1, 6:1, and 10:1 ratio) of negative to positve pairs inde-
pendently for 20 epochs and validated with 5-fold cross-validation with Biowulf
Cluster from the National Institute of Health (NIH) High-Performance Com-
puting (HPC) Systems using a mix of Nvidia Tesla P100 and V100 graphical
processing unit. A set of experiments are conducted prior on a small data set
(training and validation size of 100,000 and 20,000 respectively) to gauge the
performance and desired capabilities of the models as well as to fine-tune the
hyper-parameters with different incremental range (e.g. learning rate with a
range of 0.0005 to 0.001, batch size with a range from 128 to 512). Table 7
summarizes the final set of parameters and hyper-parameters that are used for
Siamese LSTM (baseline experiment 1) and Siamese CNN-LSTM (enriched ex-
periment 2, 3, 4, and 5) respectively.
3 Results and Evaluations
We evaluate the performance of the models in terms of validation accuracy,
precision, recall, overall F1-Score, specificity, sensitivity, and false-positive rate.
Out of all the various proportions of negative to positive pairs, the 6:1 ratio
achieves the best performance in terms of validation accuracy in identifying
and classifying synonyms and non-synonyms. Table 8 shows the full performance
metrics achieved by the 6:1 ratio of negative to positive pairs and Table 9 shows
various examples of true positives and true negatives correctly identified, false
positives identified, and false negatives not identified by experiment 5.
12 HY Yip et al.
Table 7. The Set of Parameters used for Siamese LSTM and Siamese CNN-LSTM
respectively.
Parameters/
Siamese LSTM Siamese CNN-LSTM
Hyperparameters
Framework Keras 2.0 with Tensorflow backend
Word Vector Size 200
Maximum
30
Input Length
Maximum Context
- 60
Input Length
Embedding BioWordVec
LSTM Hidden Units 50
LSTM Activation Tanh
CNN Filters - 100
CNN Window Size - 5
CNN Activation - ReLU with batch normalization
Fully Connected Layer 1 - 128 units with ReLU activation
Fully Connected Layer 2 - 50 units with ReLU activation
Weights and Biases Random Initialization
Optimizer Adam
Learning Rate 0.001
Loss Function Mean Squared Error (MSE)
Batch Size 128
Number of Training Epochs 20
Validation 5-fold cross-validation
Table 8. Performance of the 6:1 Ratio of Negative to Positive Pairs
Model/ Siamese LSTM Siamese CNN-LSTM
Performance
Metrics Exp. 1 Exp. 2 Exp. 3 Exp. 4 Exp. 5
Base
Base
Base + SS
Base + SS
Base + HC + HC
+ SS + HC
+ SG + HSS
+ SG
+ SG
Accuracy 0.93333 0.8720 0.9486 0.9520 0.9541
Precision 0.7828 0.8654 0.7643 0.8296 0.8009
Recall 0.7379 0.8874 0.8381 0.9038 0.8978
F1-Score 0.7597 0.8763 0.7995 0.8428 0.8466
Specificity 0.9659 0.8560 0.9640 0.9601 0.9633
Sensitivity 0.7379 0.8874 0.8381 0.9038 0.8978
False Positive Rate 0.0341 0.1440 0.0360 0.0399 0.0367
Exp.: Experiment, SS: Source Synonymy, HC: Hierarchical Context, SG: Semantic Group,
HSS: Hierarchical Source Synonymy
13
Table 9. Examples of True Positives, True Negatives, False Positives, and False Neg-
atives from Experiment 5
True Positives (Synonyms) Correctly Identified
nail clipper cutters nail
injury of salivary gland salivary gland injury
avulsion fracture sprain
True Negatives (Non-synonyms) Correctly Identified
fingernail infection of fingernail
product containing only iron product containing only levorphanol
medicinal product medicinal product
medical and surgical gastrointestinal system medical and surgical gastrointestinal system
insertion ileum via natural or artificial revision stomach via natural or artificial
opening endoscopic infusion device opening endoscopic other device
False Positives (Non-synonyms) Identified
finding of wrist joint finding of knee joint
malignant neoplasm of upper limb malignant neoplasm of muscle of upper limb
skin wound of axillary fold skin cyst of axillary fold
False Negatives (Synonyms) Not Identified
hla antigen human leukocyte antigen
pyelotomy incision of renal pelvis treatment
routine cervical smear screening for malignant neoplasm of cervix
4 Discussion
Based on Table 8, we observe that using only the lexical features of atom yields an
overall F1-score of 75.97%. Infusing source synonymy to the base yields a higher
precision and overall F-1 score of 86.54% and 87.63% respectively. Whereas, in-
fusing hierarchical context trades precision for higher recall of 90.38%. Infusing
source synonymy, hierarchical context, and the semantic group gives an overall
boost to the accuracy of 95.20%. However, infusing source synonymy of hierar-
chical context does not yield any noticeable improvement. Some of the plausible
explanations are synonyms provided by the source are closely related and they
are alternative variants to the base atom, hence the higher precision. Whereas,
hierarchical contexts or parents and children relationships represent broader and
narrower relations that encompass a wider variety of lexical features to the base
atom, hence the higher recall. However, extending the hierarchical context to
include the source synonymy of the parents and children atoms may be over-
stretched from the original semantics of the base atom and the model may per-
ceive them as noise.
Based on Table 9, we observe the performance of the trained Siamese model
from Experiment 5 on real-scenario examples. With the incorporation of LSTM,
the model is able to handle both short and long sequences as well as learn the
positional variants of the atoms, e.g. "injury of salivary gland" versus "salivary
14 HY Yip et al.
gland injury". Combining with CNN, the model is able to extract and learn
pairs that are lexically similar in nature but are not synonymous, e.g., "product
containing only iron medicinal product" versus "product containing only levor-
phanol medicinal product" and vice versa, atoms that are lexically dissimilar but
are synonymous, e.g., "avulsion" versus "fracture sprain". Nonetheless, for words
that are closely related to each other semantically such as "wrist" and "knee",
and "wound" and "cyst", the model fails to recognize them as non-synonyms.
In addition, the model fails to identify synonyms with lexical features that are
rare such as "pyelotomy" which indicates that there is still room for fine-tuning
the model e.g. expanding the capability of the current architecture to learn from
more examples.
5 Conclusion
In conclusion, this study demonstrates the feasibility of using DL to identify syn-
onymy and non-synonymy among atoms with relatively good performance indi-
cating a promising potential for emulating the current Metathesaurus building
process. In addition, a knowledge-infused DL approach leveraging multiple
streams of knowledge provides the necessary contextualization to disambiguate
lexically identical features and achieves an overall higher performance compared
to vanilla DL approach. Future works include (a) evaluations with the manual
rule-based normalization process of constructing the Metathesaurus since the
current evaluations are done within the scope of DL, i.e. evaluating whether
infusing additional knowledge (features) provide better performance, but not
between the traditional and automatic building process, and (b) investigation of
the scalability, maintenance, and applicability aspects of these models to com-
plement the current lexical processing and the UMLS human editors.
6 Acknowledgment
This work was supported by the Intramural Research Program of the NIH, Na-
tional Library of Medicine. This research was also supported in part by an ap-
pointment to the National Library of Medicine Research Participation Program.
This program is administered by the Oak Ridge Institute for Science and Educa-
tion through an inter-agency agreement between the U.S. Department of Energy
and the National Library of Medicine.
References
1. Bodenreider, O.: The Unified Medical Language System (UMLS): integrating
biomedical terminology. Nucleic Acids Research. 32, 267D-270 (2004)
2. Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K.,
Cui, C., Corrado, G., Thrun, S., Dean, J.: A guide to deep learning in healthcare.
Nature Medicine. 25, 24-29 (2019)
15
3. Russell, S., Norvig, P.: Artificial intelligence: a modern approach. (2009)
4. Bengio, S., Dembczyński, K., Joachims, T., Kloft, M., Varma, M.: Extreme Classi-
fication.
5. Lai, A., Hockenmaier, J.: A denotational and distributional approach to semantics.
In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval
2014), pp. 329-334. (2014)
6. Zhao, J., Zhu, T., Lan, M.: Ecnu: One stone two birds: Ensemble of heterogeneous
measures for semantic relatedness and textual entailment. In Proceedings of the
8th International Workshop on Semantic Evaluation (SemEval 2014), pp. 271-277.
(2014)
7. Severyn, A., Nicosia, M., Moschitti, A.: Learning semantic textual similarity with
structural representations. In Proceedings of the 51st Annual Meeting of the Associ-
ation for Computational Linguistics (Volume 2: Short Papers), pp. 714-718. (2013)
8. Gomaa, W. H., Fahmy, A. A.: A survey of text similarity approaches. International
Journal of Computer Applications. 68(13), 13-18 (2013)
9. Hall, P. A., Dowling, G. R.: Approximate string matching. ACM computing surveys
(CSUR). 12(4), 381-402 (1980)
10. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval.
Information processing & management. 24(5), 513-523 (1988)
11. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., Dean, J.: Distributed repre-
sentations of words and phrases and their compositionality. In Advances in neural
information processing systems, pp. 3111-3119. (2013)
12. Landauer, T. K., Dumais, S. T.: A solution to Plato’s problem: The latent se-
mantic analysis theory of acquisition, induction, and representation of knowledge.
Psychological review. 104(2), 211 (1997)
13. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based mea-
sures of text semantic similarity. In AAAI (Vol. 6, No. 2006), pp. 775-780. (2006)
14. Mueller, J., Thyagarajan, A.: Siamese recurrent architectures for learning sentence
similarity. In Thirtieth AAAI Conference on Artificial Intelligence. (2016)
15. He, H., Gimpel, K., Lin, J.: Multi-perspective sentence similarity modeling with
convolutional neural networks. In Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, pp. 1576-1586. (2015)
16. Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification
using a "siamese" time delay neural network. In Advances in neural information
processing systems, pp. 737-744. (1994)
17. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively,
with application to face verification. In CVPR (1), pp. 539-546. (2005)
18. Synnaeve, G., Dupoux, E.: A temporal coherence loss function for learning unsu-
pervised acoustic embeddings. Procedia Computer Science. 81, 95-100 (2016)
19. Neculoiu, P., Versteegh, M., Rotaru, M.: Learning text similarity with siamese
recurrent networks. In Proceedings of the 1st Workshop on Representation Learning
for NLP, pp. 148-157. (2016)
20. Rychalska, B., Pakulska, K., Chodorowska, K., Walczak, W., Andruszkiewicz, P.:
Samsung Poland NLP Team at SemEval-2016 Task 1: Necessity for diversity; com-
bining recursive autoencoders, WordNet and ensemble methods to measure semantic
similarity. In Proceedings of the 10th International Workshop on Semantic Evalua-
tion (SemEval-2016), pp. 602-608. (2016)
21. Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., Schmidhuber, J.:
LSTM: A search space odyssey. IEEE transactions on neural networks and learning
systems. 28(10), 2222-2232 (2016)
16 HY Yip et al.
22. Tai, K. S., Socher, R., Manning, C. D.: Improved semantic representations from
tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075.
(2015)
23. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint
arXiv:1408.5882. (2014)
24. He, H., Gimpel, K., Lin, J.: Multi-perspective sentence similarity modeling with
convolutional neural networks. In Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, pp. 1576-1586. (2015)
25. Zhang, Y., Chen, Q., Yang, Z., Lin, H., Lu, Z.: BioWordVec, improving biomedical
word embeddings with subword information and MeSH. Scientific data. 6(1), 52
(2019)
26. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. arXiv preprint arXiv:1301.3781. (2013)
27. Aggarwal, C. C., Hinneburg, A., Keim, D. A.: On the surprising behavior of dis-
tance metrics in high dimensional space. In International conference on database
theory, pp. 420-434. Springer, Berlin, Heidelberg (2001)
28. Pontes, E. L., Huet, S., Linhares, A. C., Torres-Moreno, J. M.: Predicting
the Semantic Textual Similarity with Siamese CNN and LSTM. arXiv preprint
arXiv:1810.10641. (2018)
29. Kingma, D. P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980. (2014)