Self-learning ontological concept representation for
searching and matching tasks
Duy-Hoa Ngo, Bevan Koopman
The Australian e-Health Research Centre, CSIRO


                                      Abstract
                                      Ontology searching and matching can be achieved by learning vector-based representation of ontology
                                      concepts. Choosing appropriate features is a key step to learn good concept representations. Lexical
                                      annotations and structural features are complementary to the concept representation, but how to combine
                                      them to work together effectively is still an open question. To handle this problem, we propose a self-
                                      generated data method from input ontologies and corresponding deep neural network models to encode
                                      ontological concepts as embedding vectors. Our experiments on biomedical ontologies (SNOMED CT
                                      vs. HPO) showed that using semantic embedding features can increase searching effectiveness by over
                                      20% from baseline methods. Our approach provides a generic method for both ontology searching and
                                      matching.

                                      Keywords
                                      Machine learning, Label embedding, Transformers, Semantic similarity, Ontology matching


1. Introduction
Concept matching and searching are fundamental for tasks such as text data annotation, in-
tegration and analysis. Concept matching involves matching a concept in one ontology to its
equivalent in another. In health domain, matching facilitates interoperability, thus allowing
collaboration between various healthcare centres when their health data resources can be
shared and integrated for intensive analysis. Whereas, concept searching involves finding the
relevant concept in an ontology given some free-text input extracted from many sources like
patient’s medical record, publications, reports or news about health care. Searching allows
unstructured text to be mapped to a standard coding system to reduce ambiguity and enable
machine understanding through semantic expressions.
   Ontology matching (OM) tools aim to automate the laborious human task of concept matching
between ontologies. While OM tools continue to make improvements, there is still a gap between
automated OM methods and human terminology experts [1]. Typically, the produced mappings
are suggested as the best results that the OM tool can find, but nothing guarantees that they are
all correct or complete. Therefore, terminology experts are still needed to verify the matching
results by using different features of concepts such as lexical annotations (e.g., concepts’ labels,
descriptions) and structural information (relationship between concepts) to remove incorrect
and add missing mappings. Lexical and structural information are complementary to the concept

OM-2022: The 17th International Workshop on Ontology Matching, Hangzhou, China, October 23-27, 2022
$ hoa.ngo@csiro.au (D. Ngo)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
representation, but the way to combine them effectively is still a hard challenge [2].
   To handle this problem, we propose a method to learn semantic vector representation of
ontology concepts from their lexical and structural features, which then can be used with any
distance metric for searching and matching concepts among ontologies. Particularly, we propose
a self-generated data method from input ontologies and corresponding deep neural network
models to encode ontological concepts into semantic embedding vectors.


2. Methodology
Below we list some specific observations that inform our automated approach:

  (a) The matching step usually comes after searching step, in which only small set of concepts
      will be selected as potential candidates for further analysis. A popular method is based on
      using a hash function on a concepts’ string label to filter out concepts sharing one or few
      tokens. A drawback of this method is fail when concept labels share no common terms;
      e.g., Hypoplasia of lower limb vs. Rudimentary leg, which occurs frequently
      in biomedical ontologies.

  (b) A specific meaning of a word depends on the context where it stands with other words
      and the domain knowledge it belongs to.

  (c) Structural information presents the semantic of a concept through logic expressions with
      other concepts in the ontology. However, it is usually used to check consistency of the
      mapping candidates instead of discovering new mappings because even the same concept
      may have vary logic expressions in different ontologies.

  (d) Hierarchical information presents a degree of granularity, which can be used for estimating
      the distance of concepts in the same ontology.

  (e) A common pattern in defining a new concept in biomedical domain is “Lexically suggest,
      logically define” [3], which means the logical expression of the concept can be inferred
      from the meaning of its names.

   The observations shows the importance of lexical annotations in representation of concepts,
but without understanding the meaning of words in their context, potential matching candidates
will be ignored. A better concept representation is an embedding vector, which shows promising
results in deep learning approaches for sentence embedding [4]. Therefore, we encode every
concept label in ontologies into a semantic embedding vector. Then, a concept representation is
an average vector embedding of all its label embedding vectors.
   On the other hand, observations shows the importance of structural information as well as
the vocabulary used in an ontology is domain specific, so concept vector representations should
be learned directly from the ontologies in the same domain. Additionally, a distance metric can
be used to measure the similarity of vector representation, so vector representation should be
learned in a way that the distance of closely related concepts in ontology hierarchy should be
closer than that of the unrelated concepts.
2.1. Deep neural network architecture
As the objectives of learning a good concept representation is its use in similarity and ranking
distance tasks, Siamese [4] and Triplet [5] neural network models (see Fig. 1) were adopted for
training. In summary, these models use a BERT [8] - a transformer-based deep learning for
natural language processing (NLP), with a pooling strategy to encode a string input into an
embedding vector.
   An input instance to the Siamese model consists of two string and a similarity score. The
training objective is to fine tune parameters of BERT thus the cosine similarity of embedding
vectors of the the two input strings approximates to the given similarity score. Whereas for
the Triplet model, an input instance includes three strings: an Anchor string, its closely related
Positive string, and an unrelated Negative string. Its training objective is to fine tune BERT so the
distance between the embedding vectors of Anchor and Positive is less than the distance between
embedding vectors of Anchor and Negative vectors. Since there is no public and available
machine learning data set for learning concept representation in these formats, we proposed
algorithms to self-generate training data for Siamese and Triplet models from the ontologies.


Figure 1: Siamese and Triplet neural network architecture


2.2. Self-generated data from ontologies
To provide training data set for Siamese model, a similarity score is required for every pair of
string labels. Lin measure [7] with ancestors’ subgraph-based intrinsic information content
(AsIIC) [6] has been chosen as they showed the highest correlation with human judgments in
several widely-used benchmarks.
   Generating training data for the Siamese model works as follows. Computes intrinsic in-
formation content, AsIIC value, for every concepts. Randomly choose pairs of concepts and
calculate their Lin similarity score. Assigns similarity scores for all pairwise annotated labels of
concepts chosen in the previous calculation. Additionally, if a concept has multiple preferred
labels or synonymous labels, then the algorithms assigns similarity score 1.0 to each pair of
those labels.
   There are two methods for generating training data for the Triplet model: a “hard” method
and an “adaptive” method. The “hard” method applies two distance ranking rules namely
Synonym and Parent-Child on the hierarchical structure of the given ontology. The Synonym
rule states that the two synonym labels are the most similar pair, so they always take the place
of Anchor and Positive inputs; any other concept label will be a Negative input. The Parent-Child
rule states that the distance from a child to its direct parent is always less than the distance of
this child to its siblings, uncles or grandparents. Then, for concept label, the algorithm assigns it
as an Anchor and randomly assigns the concept’s parent as a Positive; the Negative is randomly
chosen from the concept’s siblings, uncles or grand-parents.
   On the other hand, for the “adaptive” method, the Synonym rule is used to generate Anchor
and Positive inputs as described above; a pre-trained model is used to generate Negative inputs.
Particularly, for each concept label, a pre-trained model is used to search the most similar labels
from all concepts in the given ontology. By removing all synonymous labels of the searched
label from the searching results, we obtain a list of Negative inputs. This forces the model to
learn to be far more sensitive to what is actually a relevant matching concept rather than just a
related one.


3. Experiment and evaluation
In the scope of the short paper, we only design experiments and perform evaluation for ontolog-
ical concept searching task. For evaluation, we used clinical concept alignment between Human
Phenotype Ontology (HPO) and SNOMED CT (SCT)1 . It was manually created by terminology
experts in our terminology matching project and far more complete than the mappings provided
in BioPortal automatically generated by LOOM algorithm2 . Our dataset contains 5,978 HPO
concepts having been matched to SCT concepts. The total annotated labels of those selected
HPO concepts are 14,149 labels.
   Two types of searching experiments were conducted as follows: 1) Label searching illustrates
a scenario where a terminologist wants to retrieve the best matched concept for a given query
string. 2) Concept searching illustrates a scenario where a terminologist wants to match
concepts from different ontologies for further data interoperability.
   Firstly, we encoded all concepts in SCT into embedding vectors, then index them into Non-
Metric Space for quick approximate nearest neighbour searching algorithm [9]. For searching,
HPO concepts and their annotated labels are also encoded into an embedding vectors that
can be searched from indexing space to return top 𝐾 nearest neighbours. Recall that a vector
representation of a concept is defined by getting the average of its labels’ embedding vectors.
   Table 1 shows the evaluation results for different approaches. We use Hits@K (𝐾 = 1, 5, 10)
- a common metrics in information retrieval to evaluate the searching performance of those
methods. For a given query, a Hits@K value is equal to 1 if the relevant concept is found in the
top 𝐾 results; otherwise it is 0. The Baseline is a simple BM25 search model using the concept
    1
        HPO ver. 08.11.2019 vs. SCT ver. 29.02.2020
    2
        https://bioportal.bioontology.org/ontologies/HP/?p=mappings
Table 1
Hits@K evaluation on concept and label searching
           HPO2SCT                        Label searching        Concept searching
           Query Size                      #labels=14149            #concepts=5978
           Hits@K                      K=1      K=5      K=10   K=1      K=5       K=10
           Baseline (BM25)             0.334   0.486    0.553   0.529     0.718     0.778
           Bio-BERT                    0.296   0.438    0.488   0.549     0.740     0.794
           Siamese BERT                0.616   0.830    0.888   0.728     0.902     0.937
           Triplet BERT                0.608   0.797    0.844   0.768     0.923     0.944
           Continue Training I         0.645   0.842    0.878   0.782     0.930     0.958
           Continue Training II        0.703   0.881    0.916   0.792     0.922     0.947


label as the query. The Bio-BERT method uses a pre-trained BioBERT3 to encode a concept
label into a embedding vector. Siamese BERT and Triplet BERT do the same, but the BERT
parameters have been fine-tuned with training data self-generated from HPO and SCT. The
Continue Training I is a continuous training starting from pre-trained Bio-BERT to Siamese
BERT and finally Triplet BERT. Here, the data set for the Triplet BERT model is generated by
using the “hard” method. The Continue Training II is a continuous training starting from
pre-trained Bio-BERT to Triplet BERT using “hard” method, then again continue training with
Triplet BERT using the “adaptive” method.
   Our proposed method improves Hits@K by 30% at label level and over 20% at concept
level in comparison with the Baseline and pre-trained Bio-BERT methods in all Hits@K
metrics. The experimental results also show that without training on appropriate data, the
performance of Bio-BERT does not improved from the Baseline method, despite the fact that
Bio-BERT is the state-of-the-art language model achieving the best results in many biomedical
NLP tasks. Another interesting point here is that Siamese BERT outperforms Triplet BERT at
label level, but worse than that at the concept level, however, after continuous training, the final
models achieved the best results at all levels. These experiments demonstrate the importance of
self-generated data from the input ontologies in learning ontological concept representation,
and it suggests that continuous training can enhance the concepts’ vector representation and
improve the concept searching performance.


4. Conclusion
We present three methods of self-generating training data from ontologies. A Siamese BERT and
Triplet BERT networks learn concept representation that are used for ontology searching and
matching tasks. An empirical evaluation on two biomedical ontologies (SNOMED CT & HPO)
these methods were effective as matching relevant ontology concepts, outperforming existing
BM25 and BERT baselines. These model represent to further step toward fully automated
ontology matching that does not require laborious manual effort by humans.


   3
       https://github.com/dmis-lab/biobert
References
[1] Euzenat, Jerome and Shvaiko, Pavel. Ontology Matching. Springer (2013).
[2] DH Ngo and Zohra Bellahsene. Overview of YAM++ - (not) Yet Another Matcher for
    ontology alignment task. J. Web Semantics (2016)
[3] Rector, Alan and Iannone, Luigi. Lexically Suggest, Logically Define: Quality Assurance of
    the Use of Qualifiers and Expected Results of Post-Coordination in SNOMED CT. Journal of
    Biomedical Informatics (2012)
[4] Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese
    BERT-Networks. EMNLP (2019).
[5] Ngo DH, Kemp M, Truran D, Koopman B, Metke-Jimenez A. Semantic Search for Large
    Scale Clinical Ontologies. AMIA (2021).
[6] Ben Aouicha M, Hadj Taieb MA. Computing semantic similarity between biomedical con-
    cepts using new information content approach. J Biomed Inform. (2016)
[7] Lin Dekang. An Information-Theoretic Definition of Similarity. ICML (1998)
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, K. Toutanova. BERT: Pre-training of Deep
    Bidirectional Transformers for Language Understanding. NAACL (2019).
[9] Malkov, Yu A. and Yashunin, D. A. Efficient and Robust Approximate Nearest Neighbor
    Search Using Hierarchical Navigable Small World Graphs. IEEE Trans. Pattern Anal. Mach.
    Intell (2020).