<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tradeoffs in Measuring Entity Similarity for Pattern Detection in OWL Ontologies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eleni Mikroyannidi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Stevens</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luigi Iannone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Manchester</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Syntactic regularities are repetitive structures in the asserted axioms of an ontology represented as generalisations, which are axioms with variables. The Regularity Inspector for Ontologies (RIO) is a framework for detecting such regularities in ontologies. Established clustering techniques are applied to the signature of the ontology to detect clusters of similar entities. Clustering depends on pairwise entity distances, which determine the similarity of two entities. In this paper we present three variations on similarity definition that affect pairwise distances and thus the regularities detected. Our analysis explores and compares methods that capture regularities of different granularity; in particular we analyse commonalities and differences between the generalisations and clusters that result from the three variations of similarity and check if they capture dominant patterns in the ontology in the same way. We perform the analysis using the BioPortal corpus and we discuss the tradeoffs of each similarity function.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Ontologies are useful, but complex, logical artifacts whose development and reuse is
a difficult and time consuming task. Embedding knowledge patterns in ontologies is
an approach for facilitating these processes. It can help the systematic development of
the ontology and it is a technique for minimising discrepancies in the axioms. When a
discrepancy occurs in the axioms as a result of a faulty script, it can be addressed by
fixing and rerunning the initial script; otherwise, the ontology engineer has to manually
inspect and fix the faulty axioms by hand, ensuring each and every necessary axiom is
fixed. Thus, regularity in the development cycle of an ontology can help the organisation
and clarity of the composition of the axioms.</p>
      <p>
        In the development of big ontology projects like KUPKB1, FMA2, and so on, such
semi-automated approach for populating the ontology through design templates such
as scripts or spreadsheets is a common procedure [1,6]. However, ontology engineers
who are later reusing an ontology usually lack the ontology’s original documentation.
For example, in BioPortal the developers will publish only the ontology and perhaps a
link to the online documentation for the ontology, but this is not obligatory. It requires
effort from the user to understand how the domain is described in the axioms of the
ontology. Finding structural commonalities in the axioms is a key task as it can help
understanding how the ontology was constructed.
1 http://www.kupkb.org/
2 http://sig.biostr.washington.edu/projects/fm/AboutFM.html
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Van SubClassOf Vehicle
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Bus SubClassOf Vehicle
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Lorry SubClassOf Vehicle
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) Bicycle SubClassOf Vehicle
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) Driver EquivalentTo Person and (drives some Vehicle)
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) VanDriver SubClassOf Driver
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) LorryDriver SubClassOf Driver
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        ) BusDriver SubClassOf Driver
(
        <xref ref-type="bibr" rid="ref9">9</xref>
        ) VanDriver SubClassOf drives some Van
(
        <xref ref-type="bibr" rid="ref10">10</xref>
        ) LorryDriver SubClassOf drives some Lorry
(
        <xref ref-type="bibr" rid="ref11">11</xref>
        ) BusDriver SubClassOf drives some Bus
(12) PetOwner EquivalentTo Person and (hasPet some Pet)
(13) CatOwner SubClassOf PetOwner
(14) DogOwner SubClassOf PetOwner
(15) CatOwner SubClassOf hasPet some Cat
(16) DogOwner SubClassOf hasPet some Dog
(17) CatLiker EquivalentTo Person and (likes some Cat)
(18) DogLiker EquivalentTo Person and (likes some Dog)
where ?Vehicle =fCar, Bus, Lorry g, ?Driver =fVanDriver, LorryDriver, BusDriverg,
are meta-linguistic variable holding the similar entities . g1 and g2 are generalisations,
which are axioms with at least one variable. These generalisations are an axiom schema
as they are build on the basis of the asserted axioms of the ontology.
      </p>
      <p>To achieve the unsupervised detection of such regularities, RIO uses standard
clustering algorithms to detect clusters of entities with similar usage in the axioms of an
ontology. A key task in this procedure is the definition of similarity between entities;
the similarity measure should capture both structural and content similarities between
the axioms that describe similar entities.</p>
      <p>RIO can be used for obtaining an intuition of the construction of an ontology with
respect to its underlying patterns [8] and that can be used for more systematic quality
assurance [10]. In this paper we compare three variations of a replacement function that
capture the similarity between entities with respect to their usage in the asserted axioms
of an ontology.</p>
      <p>These replacement functions use heuristics for capturing the most important
commonalities between axioms and they are plugged in to the methods for the computation
of the similarity distance between entities. With the comparison we want to highlight
similarities and whether these similarities correspond to dominant patterns in the
ontology. In addition, such an analysis can be helpful for deciding which method or
combination of methods to select depending on the task for which regularities are used.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Related work on the detection of patterns in ontologies has mainly focused on the
supervised detection of regularities rather than unsupervised detection. In [7] a method
for matching axioms with ontology design patterns is described. However, this method
cannot detect patterns with the meaning of knowledge patterns [3]. In [4] the authors
describe TEIRESIAS, as an assistant in the task of building a large knowledge-based
system. The system used an inference engine and a knowledge base, the whole
procedure is guided by the user for the detection of structural similarities. In the same context,
authors in [5] define methods for mining patterns from ontologies with DL-safe rules
in a supervised way. Supervised detection of patterns through SPARQL queries is
described in [11].</p>
      <p>Unification is also a close area to pattern detection, with respect to the definition
of variables which can hold similar entities [2]. Unification was initially motivated for
finding redundancies in axioms. The algorithm for defining variables, called a
unifiability test, is non-deterministic and undecidable for logics higher than E L. RIO deals with
the problem of variable definition through clustering.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Syntactic Regularity Detection with RIO</title>
      <p>The main step we examine in this paper is the computation of entity pairwise
distances (step 1 of the workflow in Figure 2) and in particular the comparison of different
replacement functions, which are plugged in the distance similarity measure. The
preprocessing of data is an important step prior to clustering as it defines how similar data
are and eliminates noise and unwanted features.
The main challenge in detecting the semantic patterns is to decide which entities to
replace with variables (e.g. in ?Driver SubClassOf Driver, variable ?Driver holds all
drivers). To achieve that, we perform clustering in the signature of the entailments to
find groups of similar entities that will be represented with variables in the
generalisations.</p>
      <p>The similarity between two entities is measured with respect to their usage in the
axioms of an ontology. For example, different types of drivers are expected to be found
in the same cluster, since they are used in the same way in the axioms of an ontology.
For the definition of the similarity distance, we need first to introduce the replacement
function.</p>
      <p>
        Replacement function . Given an ontology O, and a set of axioms S for O, we
define =f ?class, ?objectProperty, ?dataProperty, ?star g a set of four symbols that
do not appear in the Sig(O). A placeholder replacement is a function : Sig(O) !
Sig(O) [ which, when applied to an entity e 2 Sig(S), returns: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) one of e, ?star
or ?class if e is a class name; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) one of e, ?star or ?objectProperty if e is an object
property name; (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) one of e, ?star or ?dataProperty if e is a data property name; (
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
one of e, ?star or ?individual if e is an individual name.
      </p>
      <p>In a nutshell, a replacement function decides whether or not to replace an entity e
with a symbol.</p>
      <p>Distance Given an ontology O, and the signature of O, We define the
distance between two entities, ( i, j ) 2 x as d( i; j ) jAi[AjAjij[AjAjij\Ajj , where
Ai; Aj = (Ax( i); Ax( j )), where Ax( i); Ax( j ) are the referencing axioms of
i; j respectively for which the replacement function is applied.</p>
      <p>Thus, the distance between two entities e1; e2 is computed as an overlap between
their referencing axioms that have been transformed into more abstract forms (Ai; Aj )
by the placeholder function . The defined distance is always in the interval [0,1], where
0 means that the two entities are identical and 1 that they have no similarity.</p>
      <p>is used to enable comparison between the referencing axioms of e1 and e2.
Changing the granularity of the place-holder replacement function produces more or less
sensitive distance functions (closer to value 1 or 0 respective). Different decision criteria
can be used for . These criteria perform different types of abstraction in the referencing
axioms of the entities; thus they capture different types of similarities.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Different types of replacement function</title>
      <p>Selecting a replacement function for computing distances is not a straightforward
task. This is due to the replacement function having to transform axioms in a way that
reflects their similar content. Different heuristics can be adopted for reflecting the most
important underlying semantics of the axioms when computing the distance.</p>
      <p>
        In this paper we compare tradeoffs where we delegate the decision of whether to
replace an entity in an axiom to a measure of three different criteria, producing three
different replacement functions. These criteria are (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) type of the entity, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) popularity
of the entity with respect to the other entities in the same kind of axiom within the
ontology and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) replacement of entities according to the structural similarities of their
axioms. Based on these criteria this section presents three different types of replacement
function:
      </p>
      <sec id="sec-4-1">
        <title>1. Property based replacement function</title>
        <p>2. Popularity based replacement function
3. Structural based replacement function</p>
        <p>For each type of replacement function we give a definition of what is replaced by a
general placeholder and an explanation for selecting a particular heuristics. An
example is also given from the ontology O in Figure 1 to demonstrate how the replacement
function works when applied to axioms. All these replacement functions are our
contribution.
4.1</p>
        <sec id="sec-4-1-1">
          <title>Property based replacement function</title>
          <p>Let O be an ontology and two entities i; j 2 sig(O) for which we want to compute
df i; j g. For every entity e 2 sig(O) the property based replacement function will
replace the following:
– ?star if e 2 f i; j g
– e if e is an object property, a data property or an annotation property name;
– S (e) otherwise;</p>
          <p>Example For computing the distance dfBusDriver; LorryDriverg,
the following in the referencing axioms of the entities:
will replace
ABusDriver = f?star SubClassOf ?owlclass,</p>
          <p>?star SubClassOf drives some ?owlclassg
ALorryDriver = f?star SubClassOf ?owlclass,</p>
          <p>?star SubClassOf drives some ?owlclassg
so dfBusDriver; LorryDriverg=0.</p>
          <p>This replacement function considers all properties as important in an axiom, thus
they are not replaced with a general place-holder as the other entities. The intuition
behind this is similar to the detection of isomorphic structures; two graphs are isomorphic
when there is an edge-preserving matching of their vertices. Thus, the important part
is the connections between the nodes. Considering that axioms can be represented in a
form of a graph, a similar approach is adopted for the replacement function.
4.2</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Popularity based replacement</title>
          <p>The popularity based replacement was introduced in [8]. The decision criterion for this
function is that popular entities are important, thus are retained by the replacement
function. When computing a distance between two entities, namely e1 and e2, for each
axiom where either occurs, the function replaces e1 or e2 with ?star and decides
whether to replace the other entities with a place-holder depending on their popularity
across all the axioms that have the same structure as .</p>
          <p>A confidence interval [l; u] for the mean value of popularity (95% confidence) is
used with unknown variance. Thus, for the computation of the area under the
distribution function (z), we use the values for the T distribution, rather than the normal one in
the formulas l = M z psNd ; u = M + z psNd .</p>
          <p>where with sd we denote the standard deviation and with M the mean computed on
the set of entities (whose size is N ) in the ontology. If the popularity of a given entity
is greater than u then the entity is not replaced by a placeholder, otherwise it is.</p>
          <p>Example Let us compute the replacements for calculating the distance d(VanDriver,
LorryDriver). We omit the calculations but the confidence interval for the popularity
when applied to the axioms is such that the only entities which will not be replaced are:
Driver and drives, because they are popular, thus:
A˙V anDriver = f?star SubClassOf Driver,</p>
          <p>?star SubClassOf drives some ?owlClassg
A˙LorryDriver = f?star SubClassOf Driver,</p>
          <p>?star SubClassOf drives some ?owlClassg</p>
          <p>The extensive usage of the object property drives in this particular kind of axiom
is the reason why our place-holder replacement function deems it as important and
preserves it in the replacement result. We observe, however, that deciding replacements
based on confidence intervals is strongly dependant on the quality of the sample data.
Driver, for instance, in the example above, is judged popular too. The reason is that all
drivers in the ontology are subclass of Driver. Conversely, the formula correctly spots
that several other entities (Bus, Dog, hasPet, . . . ) are not relevant when dealing with
axioms presenting a particular structure (e.g. ?owlClass SubClassOf
?owlObjectProperty some ?owlClass). We claim that this is preferable w.r.t. making an a priori
decision, maybe based on users’ intuitions, on what should be replaced and when.
4.3</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>Axiom structure based replacement function</title>
          <p>The approach introduced in [10] is based on the search of a split of the entities in
corresponding placeholders regarding their popularity, position and structure of their
referencing axioms. We will demonstrate how this transformation policy works using
our example ontology of Figure 1.</p>
          <p>The transformation is done in two steps.</p>
          <p>
            Step 1: The representation of axioms in abstract forms; This is done by replacing
every entity in an axiom with a general variable denoting the type and the position of
the entity. The transformation result for the example ontology is
?class 2 SubClassOf ?class 1
?class 2 SubClassOf ?objectProperty 1 some ?class 1
?class 3 EquivalentTo ?class 2 and ?objectProperty 1 some ?class 1
(
            <xref ref-type="bibr" rid="ref1">1</xref>
            )
(
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
(
            <xref ref-type="bibr" rid="ref3">3</xref>
            )
          </p>
          <p>
            Step 2: For each one of the general axioms (
            <xref ref-type="bibr" rid="ref1">1</xref>
            )-(
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) we retrieve their instantiations
from the ontology and check if the replacement of a variable with an entity gives a more
accurate separation of axioms in different groups.
          </p>
          <p>The choice of variable replacements depends on the structural commonalities of
the axioms. Our first criterion is that if there are more than two structural differences
between a pair of axioms then the variable should be checked for further replacements.
The idea behind this criterion is that we want to find a variable replacement in the
axioms that will reflect the differences between the entities in the ontology.</p>
          <p>
            The general axiom (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) abstracts the axioms (
            <xref ref-type="bibr" rid="ref1">1</xref>
            )-(
            <xref ref-type="bibr" rid="ref4">4</xref>
            ), (
            <xref ref-type="bibr" rid="ref6">6</xref>
            )-(
            <xref ref-type="bibr" rid="ref8">8</xref>
            ), (13) and (14) of
Figure 1. Many of these axioms have more than one structural difference such as axioms
(
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) and (
            <xref ref-type="bibr" rid="ref6">6</xref>
            ) or (
            <xref ref-type="bibr" rid="ref8">8</xref>
            ) and (13) etc.). Therefore, further possible replacements are examined.
Every placeholder replacement is represented in a tree. An example tree for the
generalisation (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) is shown in Figure 3. The general axiom is the root of the tree. Then, the
branches of the tree show all possible values for each variable of the general axiom. The
leaf nodes of the tree show the instantiations that result from the replacement of the
parent node. Replacements that abstract only a single axiom are discarded. Replacements
that separate the values of the other variables into different sets and abstract more than
one axiom are kept. For example, in Figure 3 all further splits of variable ?class 2 are
discarded as they abstract only a single axiom. However, the replacements for ?class 1
are kept as they abstract more than one axiom and separate the values of the two
variables (?class 1, class 2) into different sets. Therefore, classes Vehicle, Driver and
PetOwner in the axioms of the form of (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) are marked as “relevant” and they are not
replaced by a placeholder. The same procedure is followed for the general axioms (
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
and (
            <xref ref-type="bibr" rid="ref3">3</xref>
            ). The result is that entities PetOwner, hasPet, drives are not replaced. For
example, the distance d(VanDriver,LorryDriver) will be zero.
          </p>
          <p>?class_2 SubClassOf ?class_1
?class_2
?class_1
?class_2 =
Bicycle
?class_2 =</p>
          <p>
            Van
?claBsuss_2 = ?clLaossrr_y2 = ?VcalansDsr_iv2e=r L?ocrlarysDs_ri2ve=r ?BculassDsr_iv2e=r ?CcalatOssw_n2e=r
?class_2 =
DogOwner
?cVlaeshsic_l1e = ?clDarsisv_e1r = ?PcelatOssw_n1e=r
SuVBbeiCchylaiccslleesOf SuVbeCVhalaincslesOf SuVbeCBhulaicsslesOf SuVbLeCohlraircsylesOf SVuabDnCrDilvareisvrseOrf SLuobrDrCyriDlvaersirvseOrf SBuubDsCrDilvareisvrseOrf SCPueabttCOOlawwsnnseeOrrf SDPuoebgtCOOlwawsnnseeOrrf SuVbeCVhalaincslesOf SVuabDnCrDilvareisvrseOrf SCPueabttCOOlawwsnnseeOrrf
SuVbeCBhulaicsslesOf SLuobrDrCyriDlvaersirvseOrf SDPuoebgtCOOlawwsnnseeOrrf
SuVbLeCohlraircsylesOf SBuubDsCrDilvareisvrseOrf
Bicycle
SubClassOf
Vehicle
Step 3 in the workflow is the clustering of the entities based on their computed distances.
In RIO we use hierarchical agglomerative clustering (HAC) with stopping criterion the
maximal distance (d = 1) between all pairs for all clusters. Table 1 shows the clusters
that RIO returns for each replacement function. In the remainder of the paper, depending
on the replacement function we distinguish three variations of clustering named as:
(
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) popularity clustering, (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) structural clustering and (
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) property clustering. It should
be noted that the only parameter that varies is the type of replacement; the clustering
algorithm does not change.
          </p>
          <p>The final step of the RIO workflow is the formulation of generalisations with respect
to the detected clusters. Table 1 shows the generalisation that RIO returns for each
replacement function.</p>
          <p>Popularity
Vehicle: [Bicycle, Van, Bus, Lorry]
cluster 1: [hasPet, likes, drives]
Driver: [BusDriver, VanDriver, Lorry- Vehicle: [Bicycle, Van, Bus, Lorry]
Driver]
cluster 2: [PetOwner, Driver]
Cluster 3: [Pet, Vehicle]
PetOwner: [CatOwner, DogOwner]
cluster 4: [CatLiker, DogLiker]
cluster 5: [Dog, Cat]</p>
          <p>Structural Property
Clusters
cluster 1: [CatLiker, PetOwner, Dog- cluster 1: [Vehicle, Bus, Lorry]
Liker, Driver]
cluster 2: [Pet, Dog, Vehicle, Cat]
cluster 3: [hasPet, likes, drives] cluster 2: [CatLiker, DogLiker]
Driver: [BusDriver, VanDriver, Lorry- cluster 3: [Dog, Cat]
Driver]
PetOwner: [CatOwner, DogOwner]</p>
          <p>Driver: [BusDriver, VanDriver,
LorryDriver]</p>
          <p>PetOwner: [CatOwner, DogOwner]</p>
          <p>
            Generalisations / Instantiations
(g1) ?Driver SubClassOf drives (g1) ?cluster 1 EquivalentTo Person (g1) ?cluster 2 EquivalentTo Person
some ?Vehicle / (
            <xref ref-type="bibr" rid="ref9">9</xref>
            )-(
            <xref ref-type="bibr" rid="ref11">11</xref>
            ) and (?cluster 3 some ?cluster 2) / and (likes some ?cluster 3) / (17),
(
            <xref ref-type="bibr" rid="ref5">5</xref>
            ), (12), (17), (18) (18)
(g2) ?Driver SubClassOf Driver /(
            <xref ref-type="bibr" rid="ref6">6</xref>
            )- (g2) ?Vehicle SubClassOf Vehicle / (g2) ?PetOwner SubClassOf
(
            <xref ref-type="bibr" rid="ref8">8</xref>
            ) (
            <xref ref-type="bibr" rid="ref1">1</xref>
            )-(
            <xref ref-type="bibr" rid="ref4">4</xref>
            ) PetOwner / (13), (14)
(g3) ?PetOwner SubClassOf (g3) ?Driver SubClassOf Driver / (
            <xref ref-type="bibr" rid="ref6">6</xref>
            )- (g3) ?PetOwner SubClassOf hasPet
PetOwner / (13),(14) (
            <xref ref-type="bibr" rid="ref8">8</xref>
            ) some ?cluster 3 / (15), (16)
(g4) ?Vehicle SubClassOf Vehicle / (g4) ?PetOwner SubClassOf hasPet (g4) ?cluster 1 SubClassOf Vehicle /
(
            <xref ref-type="bibr" rid="ref1">1</xref>
            )-(
            <xref ref-type="bibr" rid="ref4">4</xref>
            ) some ?cluster 2 / (15), (16) (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ), (
            <xref ref-type="bibr" rid="ref3">3</xref>
            )
(g5) ?PetOwner SubClassOf hasPet (g5) ?PetOwner SubClassOf (g5) ?Driver SubClassOf Driver / (
            <xref ref-type="bibr" rid="ref6">6</xref>
            )
some ?cluster 5 / (15), (16) PetOwner / (13), (14) - (
            <xref ref-type="bibr" rid="ref8">8</xref>
            )
(g6) ?cluster 4 EquivalentTo Person (g6) ?Driver SubClassOf drives (g6) ?Driver SubClassOf drives
and (likes some ?cluster 5) / (17), some ?Vehicle / (
            <xref ref-type="bibr" rid="ref9">9</xref>
            )-(
            <xref ref-type="bibr" rid="ref11">11</xref>
            ) some ?cluster 1 (
            <xref ref-type="bibr" rid="ref10">10</xref>
            ), (
            <xref ref-type="bibr" rid="ref11">11</xref>
            )
(18)
(g7) ?cluster 2 EquivalentTo Person
and (?cluster 1 some ?cluster 4) /
(
            <xref ref-type="bibr" rid="ref5">5</xref>
            ), (12)
subsumer, which is not the owl:Thing entity (&gt;).
          </p>
          <p>As depicted in Table 1, we have similarities in the results between different
clustering variations. For example, the generalisation ?Driver SubClassOf Driver is common
generalisation between all three types of clustering. On the other hand, in some cases
we lose instantiations of a pattern; e.g. in property clustering we never get that Bicycle
SubClassOf Vehicle.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental Results</title>
      <p>We want to analyse the differences in detected regularities that emerge when
replacement functions are switched. To do this, we used 208 ontologies from BioPortal3
downloaded in March 2012.</p>
      <p>From these, 170 ontologies were processed in approximately 2 hours for all
different types of clustering. The number of logical axioms for these ontologies varies from</p>
      <sec id="sec-5-1">
        <title>3 http://bioportal.bioontology.org/</title>
        <p>ten to ten thousand, with a mean value of 1038 axioms. The number of entities is
between fourteen and five thousands, with a mean value of 767 entities. The remaining 38
ontologies could not be processed withing the 2 hour time limit, thus were dismissed.</p>
        <p>Regularity results From the 170 processed BioPortal ontologies, 46 ontologies
(with 9 - 4907 logical axioms, 16 - 4965 entities) were found to have no detectable
regularity (no of clusters = 04) when the property replacement function was used. From
this set of 46 ontologies, three of them (having 326 - 759 entities, 266-768 axioms) had
zero clusters when the structural clustering was considered as well. On the other hand,
clustering with popularity replacement function detected regularities for all 170
ontologies. Detailed results of the detected regularities and statistical analysis are available
online5. Here we will give a summary of the results.</p>
        <p>Table 2 shows the total average of some regularity metrics for the 124 processed
ontologies. It should be noted that on Table 2, the Cluster Coverage refers to the
number of entities of a cluster that are covered by a single generalisation. For example, in
Table 1, popularity clustering, g1 has cluster coverage 75%, as g1 is not applicable for
Bicycle. The mean instantiations per generalisation metric shows how many axioms
(instantiations) instantiate a single regularity. E.g. in Table 1, g1 for popularity
clustering has 3 instantiations. The intuition behind these metrics is that generalisations with
many instantiations are better, because we can inspect in fewer generalisations to
understand the ontology’s composition. High cluster coverage by a singe generalisation is
also desirable; to let us understand how the entities in the cluster are described by a few
generalisations.</p>
        <p>We can observe from the results of Table 2, that selecting a different replacement
function will affect the final regularity results. First of all, popularity clustering will
find more axioms from an ontology to instantiate a regularity than the other two types
of clustering6. This also justifies the cases for which no regularity was detected when
the property or the structural replacement function was selected. Similarly, the
property clustering will return more clusters with fewer entities per cluster than the other
two methods. The increased number of clusters with popularity clustering has as a
consequence the increased number of generalisations (since a variable represents entities
from the corresponding cluster). In addition, we get fewer instantiations per
generalisation with the popularity clustering than the other two. That means that if one wants
4 In the AHC every cluster starts containing only one entity. In these cases none of the clusters
got merged, thus the number of clusters having more than one entity was equal to 0.
5 http://www.cs.man.ac.uk/ mikroyae/2013/owled
6 All the comparisons are done with respect to the other types of clustering.
to inspect an ontology to understand its construction, one has to look at more
generalisations and browse more clusters with popularity clustering than with structural
or property clustering. On the other hand, structural and property function might miss
some of the regularities from the ontology. Also the cluster coverage is lower for
structural and property clustering than for the popularity clustering. Thus, for understanding
the description of entities in a cluster, more generalisations need to be checked with the
structural and property function.</p>
        <p>
          We have formulated some initial hypothesis about how the regularity results are
affected when a different replacement function is selected. Some questions that arise then
are: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) do different clustering variations detect any common clusters? (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) if a dominant
pattern exists in the ontology, will this be detected with all three types of clustering?
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) Do we get the same high ranked generalisations in terms of their instantiations
independent the replacement function?
        </p>
        <sec id="sec-5-1-1">
          <title>Cluster, generalisations and instantiation similarity To answer some of the pre</title>
          <p>
            vious questions we compare the clusters, generalisations and instantiations we obtain
from each variation of clustering. Therefore, we have three pairs of comparison: (
            <xref ref-type="bibr" rid="ref1">1</xref>
            )
popularity-structural, (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) popularity-property and (
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) structural-property. For each pair
we measure their cluster, generalisation and instantiation similarity.
          </p>
          <p>Figure 4 shows the summary for the cluster, generalisation and instantiation
similarity for each pair of comparison. The similarity metrics were computed for the 125
BioPortal ontologies for which a type of regularity was detected for all clustering
variations.</p>
          <p>The first graph (Figure 4(a)) shows that popularity and structural clustering will
return clusters with the highest similarity. Then follows the similarity between structural
and property clustering and finally the popularity and property clustering. The same
conclusion can be drawn for the rest of the similarity measures (Figure 4(b),(c)).
However, we observe that the high cluster similarity of the pair popularity-structural is not
followed by the corresponding generalisation similarity (median is 4% in Figure 4(b)).
This happens because, the clusters are initially sorted in descending order according to
their number of entities, and thus their labeling follows that ordering (e.g. Cluster 1,
Cluster 2 etc.). Entities which are found with popularity clustering in cluster 1 and
with structural clustering in Cluster 3 they will have some cluster similarity, but the
corresponding generalisation similarity is affected by the names of variables, thus it will
be low. However, this does not mean that we do not get the same structure of patterns
and this is clearly depicted in the instantiation similarity. For the popularity-structural
clustering the instantiation similarity is more than 70% with a median of 93%.
Similarly, for the other types of clustering, the instantiation similarity is more than 60% for
the majority of the processed ontologies. We can conclude that property clustering is the
one that will find the fewest axioms under a regularity as it is depicted in Figure 4(c).</p>
          <p>It needs further investigation when we want to present “dominant regularities” in
an ontology; depending on the replacement function we use, we might get different
high ranking generalisations with respect to the number of instantiations. For example,
a generalisation which is found in the 10 highest ranked generalisations of an ontology
with the structural clustering, it might be found lower in the ordering when a different
type of clustering is considered; because the instantiations were distributed in more than
q1 
min 
median 
max 
q3 
100%  
90%  
80%  
70%  
60%  
50%  
40%  
30%  
20%  
10%  
0%  
Popularity-­‐Structural   Popularity-­‐Property   Structural-­‐Property  </p>
          <p>Popularity-­‐Structural   Popularity-­‐Property   Structural-­‐Property  
(a)
Instan&gt;a&gt;on  Similarity  
(b)
Popularity-­‐Structural   Popularity-­‐Property   Structural-­‐Property  </p>
          <p>(c)
one generalisations when a different replacement function is selected. The verification
of this hypothesis remains as future work.
7</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>In this paper we compared three different replacement functions, which are used for the
computation of similarity distance between entities in an ontology. This is a key task
when detecting syntactic patterns in an ontology with the RIO framework. The
similarity distance measure is computed on the basis of the asserted axioms that reference
the entities whose distance is computed. The replacement functions we use consider
either the popularity of an entity, the type of an entity or the structural similarities of
the axioms in order to decide which entities in an axiom should be replaced with an
abstract placeholder. This will enable the computation of distance as an overlap of the
transformed axioms. Different replacement functions will provide a different
granularity of abstraction over the axioms, leading to more or less sensitive distance measures.
Here we compared three such replacement functions with respect to the clusters and
generalisations (syntactic regularities) we obtain. We argue that such an analysis can be
helpful for deciding which method or combination of methods to select depending on
the task for which regularities are used.</p>
      <p>The analysis of this paper showed that almost the same portion of axioms will
be found to instantiate a regularity regardless of the replacement function we choose.
Tradeoffs are, however, made on the granularity of the generalisations that correspond
to dominant patterns. When property-based replacement function was selected, fewer
axioms from the ontology were found to instantiate a regularity with worst performance
of no detectable regularity for 43 ontologies (the other types of clustering returned at
least 4 clusters for the same ontologies). Thus, property clustering is not the best of
the three to choose when we want to mine as many patterns as possible. Considering
that popularity clustering detects more types of regularities with a good cluster
coverage we can say that it is a good solution for achieving inspection tasks. However, more
rigorous analysis is needed with users to verify such a result. This should include the
parameter of variables names as clusters with meaningful variable names can
significantly improve the readability of generalisations. The analysis presented in this paper
can be helpful as a preprocess of future user studies for deciding selection of
clustering variations for achieving tasks such as for inspecting an ontology or for performing
quality assurance of an ontology.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>F.</given-names>
            <surname>Baader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brandt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Lutz</surname>
          </string-name>
          .
          <article-title>Pushing the EL envelope</article-title>
          .
          <source>In Proceedings of the 19th international joint conference on Artificial intelligence, IJCAI'05</source>
          , pages
          <fpage>364</fpage>
          -
          <lpage>369</lpage>
          , San Francisco, CA, USA,
          <year>2005</year>
          . Morgan Kaufmann Publishers Inc.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>F.</given-names>
            <surname>Baader</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Morawska</surname>
          </string-name>
          .
          <article-title>Unification in the description logic eln mathcal fELg</article-title>
          .
          <source>In Rewriting techniques and applications</source>
          , pages
          <fpage>350</fpage>
          -
          <lpage>364</lpage>
          . Springer,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          .
          <article-title>Knowledge patterns</article-title>
          .
          <source>Knowledge Engineering: Practice and Patterns</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>3</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>R.</given-names>
            <surname>Davis</surname>
          </string-name>
          .
          <article-title>Interactive transfer of expertise: Acquisition of new inference rules</article-title>
          .
          <source>Artificial Intelligence</source>
          ,
          <volume>12</volume>
          (
          <issue>2</issue>
          ):
          <fpage>121</fpage>
          -
          <lpage>157</lpage>
          ,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. J. Jo´zefowska,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ławrynowicz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Łukaszewski</surname>
          </string-name>
          .
          <article-title>Towards discovery of frequent patterns in description logics with rules</article-title>
          .
          <source>Rules and Rule Markup Languages for the Semantic Web</source>
          , pages
          <fpage>84</fpage>
          -
          <lpage>97</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>S.</given-names>
            <surname>Jupp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schanstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stevens</surname>
          </string-name>
          , et al.
          <article-title>Developing a kidney and urinary pathway knowledge base</article-title>
          .
          <source>Journal of biomedical semantics</source>
          ,
          <volume>2</volume>
          (
          <issue>Suppl 2</issue>
          ):
          <fpage>S7</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Khan</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Blomqvist</surname>
          </string-name>
          .
          <article-title>Ontology design pattern detection-initial method and usage scenarios</article-title>
          .
          <source>In SEMAPRO 2010, The Fourth International Conference on Advances in Semantic Processing</source>
          , pages
          <fpage>19</fpage>
          -
          <lpage>24</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>E.</given-names>
            <surname>Mikroyannidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Iannone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rector</surname>
          </string-name>
          .
          <article-title>Inspecting regularities in ontology design using clustering</article-title>
          .
          <source>The Semantic Web-ISWC</source>
          <year>2011</year>
          , pages
          <fpage>438</fpage>
          -
          <lpage>453</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>E.</given-names>
            <surname>Mikroyannidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A. A.</given-names>
            <surname>Manaf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Iannone</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Stevens</surname>
          </string-name>
          .
          <article-title>Analysing syntactic regularities in ontologies</article-title>
          .
          <source>In OWL: Experiences and Directions Workshop</source>
          , OWLED,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. E. Mikroyannidi,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Iannone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rector</surname>
          </string-name>
          , et al.
          <article-title>Analysing Syntactic Regularities and Irregularities in SNOMED-CT</article-title>
          .
          <source>Journal of biomedical semantics</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ):
          <fpage>8</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>O.</surname>
          </string-name>
          <article-title>Sˇva´b-</article-title>
          <string-name>
            <surname>Zamazal</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Scharffe</surname>
            , and
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Sva</surname>
          </string-name>
          <article-title>´tek. Preliminary results of logical ontology pattern detection using sparql and lexical heuristics</article-title>
          .
          <source>In Proceedings of the Workshop on Ontology Patterns (WOP-2009). Citeseer</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>