=Paper= {{Paper |id=Vol-2918/short3 |storemode=property |title=Ontology Clustering with OWL2Vec* |pdfUrl=https://ceur-ws.org/Vol-2918/short3.pdf |volume=Vol-2918 |authors=Ashley Ritchie,Jiaoyan Chen,Leyla Jael Castro,Dietrich Rebholz-Schuhmann,Ernesto Jiménez-Ruiz |dblpUrl=https://dblp.org/rec/conf/esws/RitchieCCRJ21 }} ==Ontology Clustering with OWL2Vec*== https://ceur-ws.org/Vol-2918/short3.pdf
               Ontology Clustering with OWL2Vec*

                 Ashley Ritchie1, Jiaoyan Chen2, Leyla Jael Castro3,
                Dietrich Rebholz-Schuhmann3, Ernesto Jiménez-Ruiz1
                   1
                  City, University of London, London, United Kingdom
                        2
                          University of Oxford, United Kingdom
            3
              ZB MED Information Centre for Life Sciences, Cologne, Germany



       Abstract. In this work we present an exploratory study to apply OWL2Vec* to
       drive the clustering of ontology entities (i.e., ontology clustering). OWL2Vec*
       is a state-of-the-art system that creates embeddings, capturing the semantics of
       both entities and tokens that appear in the ontology.


1      Introduction
Vector embeddings are used to process large datasets as the vectorized space makes it
easier to find insights and combine data from multiple sources. Although initially
used for text, embeddings are also used to process graph-based structures such as
knowledge graphs (KGs) largely composed of facts. Quite a few KG embedding
techniques have been proposed and applied to entity clustering [1], but few have
considered ontologies which are more complicated. In this work, we present an
exploratory study using OWL2Vec* to drive the computation of ontology entity
clusters. OWL2Vec* [2] is a state-of-the-art system that creates embeddings for both
the entities and the lexical information that appears in an ontology. We have explored
327 embedding configurations by varying the hyperparameters of OWL2Vec*. These
embeddings were then analyzed by measuring the distances between each entity’s
embeddings as well as clustering the embeddings using the k-means algorithm.
   The motivation comes from OntoClue [3], which aims to aid researchers with
better information retrieval (IR) by finding relevant articles, although not necessarily
similar. In a similar vein to topic modelling where topics are found and assigned to
documents, OntoClue plans to use clusters emerging from ontology embeddings to
later assign topics to scholarly publications. To do so, embedding-based clusters will
be combined with text-mining and named entity recognition to find related articles
across multiple clusters. In this work we focus the experiments on the Gene Ontology
(GO) [4], an ontology that is structurally and lexically rich. Our main goal was
understanding how different embedding configurations and cluster sizes would affect
the resulting clustering. Our next step, outside the scope of this paper, will be using
such clusters to assign topics to a corpus of documents and therefore improve results
from IR.
2      Ontology Embedding
Machine learning models typically assume dense numeric representation, which is
separate from how ontologies are expressed [5]. Therefore, ontology embedding is
used as a way to condense the rich semantic and structural information of an ontology
into a low-dimensional vector space that can be used downstream by machine
learning algorithms. A common method for ontology embedding is through the use of
word embedding models such as Word2Vec, which enables the computing of
continuous vector representations from a large corpus of text [6].
   OWL2Vec* [2] is an embedding method built on top of RDF2Vec [7] and
Word2Vec supporting multiple configurations and hyperparameters. OWL2Vec* has
been tested with medium (e.g. an events ontology) and small (e.g. pizza ontology)
size ontologies and has shown good performance; however, the performance is
expected to improve with larger vocabularies. The algorithm accepts an ontology as
input and produces embeddings as output. It first generates random walks over the
ontology to extract structural, lexical, and semantic information in order to create a
corpus of IRI and word sequences. Another option is to replace an entity in the
sequence by its Weisfeiler Lehman sub-tree kernel which measures the entity’s
neighborhood sub-graph in tree shape. This corpus is then fed to the word embedding
model to create IRI and word (token) vector representations, Viri and Vword.
   OWL2Vec* has four document settings, the first of which is the Structure
Document, Ds. It is composed of IRI sequences captured from the walks as well as the
axioms, or relationships, between the classes within the ontology. The second
document configuration is the Lexical Document, Ds, l , which replaces the IRIs of the
structural document with the entity labels. Additionally, it includes sentences of
lexical entity annotations.
   The last two document settings are both Combined Documents, whereby one
element of each sentence remains as an IRI and the others elements are converted to
lexical labels. The first strategy, Ds, l, rc , is based on random selection. A randomly
selected element will remain as an IRI, while the other elements are converted to their
lexical labels. The other strategy, Ds, l, tc , is to traverse each IRI in the sentence,
replacing other IRIs with their lexical labels. If there are n IRIs in the sentence, this
will create n new sentences. Table 1 shows examples of document sentences produced
from random walks of depth two starting at obo:GO_0007611 in Figure 1.




                            Fig. 1. A Snapshot of Gene Ontology.
Table 1. Sentence examples extracted from the Gene Ontology snapshot in Figure 1 with
random walk depth of two.
Document                          Sentence
Structural (Ds)                   (obo:GO_0007611, rdfs:subClassOf, obo:GO_0007610)
Literal (Ds, l )                  (“learning”, “or”, “memory”, “subclassof”, “behavior”)
Combined, random (Ds, l, rc )     (“learning”, “or”, “memory”, “subclassof”, obo:GO_0007610)
                                  (“learning”, “or”, “memory”, “subclassof”, obo:GO_0007610),
Combined, traverse (Ds, l, tc )
                                  (obo:GO_0007610, “subclassof”, “behavior”)




3        Methods
Our input data is the Gene Ontology (GO) [4] retrieved on 9th October 2020 from the
Open Biological and Biomedical Ontologies (OBO) Foundry website. GO is one of
the most successful ontology development projects to date, broadly used in Life
Sciences. GO has 44,272 classes split into three branches: biological process (28,922
classes), molecular function (11,157 classes), cellular component (4,193 classes). We
use OWL2Vec* to create embeddings which are later clustered by k-means. We then
use the Silhouette score and Bag of Words to evaluate the output.

3.1       OWL2Vec* and Clustering
OWL2Vec* requires some hyperparameters and settings to run. First we have to
define the Document and Embedding Settings. OWL2Vec* has ten different
combinations of document and embedding settings. These combinations come from
the four document options (Structural, Lexical, Combined - Random, Combined -
Traversal) and three output vector options (IRI, Word, or a concatenation of IRI and
Word vectors). The annotations that are used as lexical information (i.e., Annotation
Type) can also be configured. This project tested whether limiting the type of entity
annotations to rdfs:label and oboInOwl:hasExactSynonym would reduce noise
compared with including all annotations. Therefore the settings tested are ‘All
Annotations’ and ‘Limited Annotations’. While the Walker Type determines the
walking method, the Walk Depth controls the length of the walk. Two ontology
walking methods were considered, random walks and random walks with Weisfeiler
Lehman sub-tree kernel enabled, and walking depths of 2, 4, and 6 were tested.
Finally, the Embedding Size is defined, having an impact on the output size. Sizes of
50, 100, and 200 were tested. For each of the embeddings produced, k-means
clustering with three (trying to reflect the three GO branches) and 400 clusters were
implemented (so we would get small clusters but still with meaningful information).
This was to assess the ability to capture the structure of the three ontology branches as
well as the granularity of the produced clusters, respectively.

3.2        Evaluation Metrics of Embedding Distances and Clusters
Silhouette Score. The primary evaluation metric used for this project is the average
Silhouette score across all classes (see Eq. 1). It measures both cohesion and
separation of the clusters by calculating the mean intra-cluster distance, a(i),
differencing it with the mean inter-cluster distance of the next closest cluster, b(i), and
dividing by the greater of the two. Here i and j correspond to two items from the same
or a different cluster, and N is the number of total items.



                                                                                      (1)
Silhouette scores can range from − 1 ≤ 𝑠(𝑖) ≤ 1. A positive Silhouette score closer
to one is desirable and indicates better cohesion and separation. A negative score
indicates clusters are dispersed and lack clear inter-cluster separation. Due to the
nature of ontologies and their interconnectedness, a high degree of separation is
unlikely and may result in lower silhouette scores. In this study, average Silhouette
scores are used merely as a measure to compare embedding settings.

Bag of Words. In order to assess the granularity achieved with the clusters, we look
at ‘semantic cohesion’, namely how lexically or semantically similar items within the
same cluster are. In this assessment, the lexical information of each entity’s label,
branch, parent, and synonyms are extracted from the annotations. The Bag of Words
then calculates the most frequent 1-gram and 2-gram sequence of tokens within the
cluster.


4      Results
Capturing Ontology Structure. To assess how well the embeddings capture the
structure of GO, distances between entity embeddings were calculated. The ontology
is naturally divided into three branches (biological process, molecular function, and
cellular component) and it is expected that classes within the same branch are closer
than classes in different branches. As such, we would expect classes within the same
branch to be cohesive and slightly separated from other branches, which should yield
a positive average Silhouette score.
   Table 2 shows the average Silhouette scores based on ontology branches. While the
best performing settings are those belonging to Ds + Viri with a walk depth of 2, there
are other interesting results from the table. Most notably, when lexical information
was included in the document, embeddings created from word vectors improved the
Silhouette score while IRI vectors hindered the Silhouette score. The table also shows
that the Combined Document settings performed worse than the Lexical Document
settings. This suggests that using both IRIs and lexical information in the document
created noise instead of improving correlation for the case of the GO. The table also
shows that limiting annotations did not have substantial effect on performance.
Table 2. Average Silhouette score based on ontology branches. The subscripts s, l, rc and tc
denote structure, literal, random-based combination, traversal-based combination documents,
respectively; the subscripts iri and word denote IRI and word vectors, respectively. Color scale
goes from dark green (highest score) to dark red (lowest score).




   The settings that produced the highest branch-based Silhouette score were with
Ds+Viri (structural document with IRI vector) using the Weisfeiler Lehman sub-tree
kernel with walking depth of two and producing embedding size of 100. As such, this
combination of OWL2Vec* settings were able to capture the structure of the three
ontology branches. Figure 2a shows that the embeddings have good intra-branch
cohesion and slight inter-branch separation.
   This set of embeddings was also used to inform a k-means clustering task. Ideally,
k-means would create clusters very similar to the ontology branches. Figures 2a and
2b compare the plot of embeddings based on the ontology’s branches and the clusters
assigned by k-means. Overall, k-means produced fairly even clusters (Cluster 1:
16,401 classes, Cluster 2: 15,118 classes, and Cluster 3: 12,753 classes) resulting in a
Silhouette score of 0.183. The composition of classes in each of the clusters are:
      Cluster 1: biological process 15,217; cellular component 1,137; molecular function 47.
      Cluster 2: biological process 13,691; cellular component 1,322; molecular function 105.
      Cluster 3: biological process 14; cellular component 1,734; molecular function 11,005.
   As is visible in the cluster composition and in Figure 2b, k-means does not
separate classes within the biological function and cellular component branches,
capturing the relationships that exist across branches. However, it was able to cluster
98.64% of molecular function classes into the same cluster.




         (a) Ontology branches.                            (b) Three k-means clusters.
                        Fig. 2. Plot of embedding principal components
Cluster Granularity with K-Means. A k-means clustering was implemented for 400
clusters on each of the embeddings to see if it was possible to produce clusters of
around 110 classes each while achieving a fine leve of granularity. It was assessed
that this number of clusters would likely yield meaningful clusters, neither too
narrowly defined nor overly broad.
   Silhouette score was used to identify the embedding settings that produced clusters
with good cohesion and separation. One of the interesting outcomes from Table 3 is
that increasing the walking depth improved performance of embeddings with IRI
vectors, Viri, and decreased performance of embeddings with word vectors, Vword. This
is an opposite effect from the results in Table 2. One possible explanation for this
could be attributed to the increased challenge of keeping compact and meaningful
clusters as the number of clusters increases. It is likely for lexical similarity to
decrease as walk depth increases, thus shallower walks may produce tighter clusters
when based on lexical information. In contrast, increasing the walk depth for
structural information may help the method to create neighborhoods around common
parent classes, improving granularity as walk depth increases. Further analysis is
needed to verify this hypothesis and will be carried out by testing intermediary cluster
sizes in conjunction with OntoClue requirements.

Table 3. Average Silhouette score based on 400 k-means clusters with various settings. The
subscripts and color scale have the same meaning as Table 2.




   Overall, embeddings created from word vectors produced a higher Silhouette
score. More specifically, the Lexical Document setting produced the ten highest
Silhouette scores. The highest Silhouette score was 0.1842 and was achieved with the
document and vector settings of Ds, l + Vword with limited annotations, using a random
walker of depth 2 and the embedding size of 50. There were on average 110 classes
per cluster with the median at 75 classes per cluster. This shows that the majority of
clusters are smaller than the target of 110 with some clusters being much larger.
   Looking into the makeup of clusters with Bag of Words, it is possible to see these
embedding settings capture neighborhoods well. For example, Cluster 66, highlighted
in Table 4, is composed of 108 classes, largely related to ‘transaminase activity’ and
‘aminotransferase activity’. Of the 108 classes within the cluster, 106 are descendants
of GO term GO_0008483 'transaminase activity’ within the molecular function
branch. Comparing this with the ontology, GO_0008483 has 112 descendants,
meaning this cluster was able to capture 94.6% of the entity’s descendants. It is
interesting to note that two items within the cluster are not descendants of
GO_0008483 yet have similar lexical information in their labels. GO_0005969 and
GO_0032144 have labels ‘serine-pyruvate aminotransferase complex’ and
‘4-aminobutyrate transaminase complex’. Both belong to the cellular component
branch but are capable of transaminase activity, highlighting the ability for the
algorithms to capture cross-branch relationships to a fine degree.
              Table 4. Cluster 66 Bag of Words - Top 5 (1,2)n-grams with counts
     Label n-gram                  count       SubClassOf n-gram             count
     ‘activity’                    105         ‘activity’                    107
     ‘transaminase’                69          ‘transaminase’                90
     ‘transaminase activity’       66          ‘transaminase activity’       90
     ‘aminotransferase’            39          ‘aminotransferase’            16
     ‘aminotransferase activity’   37          ‘aminotransferase activity’   16


5       Conclusion and Future Work
In this paper, embeddings for the Gene Ontology were produced using various
settings of OWL2Vec*. When assessing the ability to capture the ontology structure,
the structure document with shallow walks and Weisfeiler Lehman sub-tree kernel
improved the Silhouette score. These settings showed good cohesion and separation
between the three ontology branches. K-means was also able to capture relationships
that exist across branches. Additionally, a k-means clustering with 400 clusters was
implemented to assess cluster granularity. Using the lexical document with word
vectors and shallow walks improved the Silhouette score. Bag of Words was used to
assess cluster makeup and lexical similarity. It would be interesting to improve upon
this method by naming clusters based on a common ancestor or by using the label of
the entity closest to the cluster centroid. In the near future we plan to apply such
clusters to assign topics (i.e., a set of ontology classes) to a corpus of documents to
improve the results of an IR task. An exploration of incremental cluster sizes may also
be useful to assess cluster makeup for the downstream IR task and gain further insight
into what the results show. In addition, OWL2Vec* embeddings can be applied in
diverse applications, for example, to predict entity relationships (as reported in [2]), or
in an ontology alignment task [8].

References
1.   Goyal, P., & Ferrara, E.: Graph embedding techniques, applications, and performance: A
     survey. Knowledge-Based Systems, 151, 78-94 (2018).
2.   Chen J, Hu P, Jimenez-Ruiz E, Holter OM, Antonyrajah D, Horrocks I. OWL2Vec*:
     Embedding       of   OWL      Ontologies. Machine Learning,          Springer,   2021.
     https://github.com/KRR-Oxford/OWL2Vec-Star
3.   Castro, LJ, Rebholz-Schuhmann D. 2021 OntoClue project. Retrieved on 14.03.2021.
     Available from: https://zbmed-semtec.github.io/ontoclue/
4.   The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still GOing
     strong, Nucleic Acids Research, Volume 47, 2019
5.   Hogan A, Blomqvist E, Cochez M, d’Amato C, de Melo G, Gutierrez C, et al. Knowledge
     Graphs. arXiv:200302320. 2020. Available from: http://arxiv.org/abs/2003.02320
6.   Mikolov T, Chen K, Corrado G, Dean J. Efficient Estimation of Word Representations in
     Vector Space. ICLR 2013
7.   Ristoski P, Rosati J, Di Noia T, De Leone R, Paulheim H: RDF2Vec: RDF graph
     embeddings and their applications. Semantic Web 10(4): 721-752 (2019). We rely on the
     pyRDF2Vec implementation: https://github.com/IBCNServices/pyRDF2Vec
8.   Chen, J., Jimenez-Ruiz, E., Horrocks, I., Antonyrajah, D., Hadian, A., Lee, J.:
     Augmenting OntologyAlignment by Semantic Embedding and Distant Supervision. In:
     European Semantic Web Conference, ESWC 2021