<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Combining Ontology Alignment Metrics Using the Data Mining Techniques</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Semantic Web Laboratory</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Computer Engineering De-</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>partment, Sharif University of Technology</institution>
          ,
          <addr-line>Tehran</addr-line>
          ,
          <country country="IR">Iran</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Several metrics have been proposed for recognition of relationships between elements of two Ontologies. Many of these methods select a number of such metrics and combine them to extract existing mappings. In this article, we present a method for selection of more effective metrics based on data mining techniques. Furthermore, by having a set of metrics, we suggest a data-mining-like means for combining them into a better ontology alignment.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Ontology Alignment is an essential tool in semantic web to
overcome heterogeneity of data, which is an integral attribute
of web. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Ontology Alignment is defined as a set of
correspondences between two or more ontologies. These
correspondences are expressed as mappings, in which Mapping is a
formal expression, that states the semantic relation between two
entities belonging to different ontologies. There have been
several proposals for drawing mappings in Ontology Alignment.
Many of them define some metrics to measure Similarity or
Distance of entities and find existing mappings using them
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. To extract mappings, in most of these methods, couples
having Compound Similarity higher than a predefined
threshold – after applying a number of constraints – are selected as
output. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] contains a number of such methods.
      </p>
      <p>In this paper, given several similarity metrics we are trying
to determine which of them is best for a particular data set,
using data mining techniques. In order to do that, we train
our techniques on some mappings for which we have a gold
standard alignment, determining which metric is the best
predictor of the correct alignment. We consider such metrics to
be the best, and calculate Compound Similarity using them.</p>
      <p>The rest of this article is organized as follows. In section 2,
a review of related works in evaluation of existing methods
and calculation of compound similarity are given. Section 3
reports our proposed method. In section 4 an example of
applying this method is shown. Finally in section 5, discusses
on its advantages and disadvantages are explained.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Existing Works</title>
      <p>Works on metric evaluation as well as a method for
aggregating results of different metrics is introduced in this section.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Alignment Evaluation Techniques</title>
      <p>
        Many of the algorithms and articles in Ontology Alignment
context uses Precision and Recall or their harmonic mean,
referred to as F-Measure, to evaluate the performance of a
method [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Also in some articles, they are used to evaluate
alignment metrics[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In such methods after aggregation of
results attained from different metrics, and extraction of
mappings – based on one of the methods mentioned in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] – the
resulting mappings are compared with actual results.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] a method for evaluation of Ontology Alignment
methods – Accuracy – is proposed. This quality metric is based
upon user effort needed to transform a match result obtained
automatically into the intended result.
1
Accuracy = Recall × (2 − P recision )
(1)
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Calculation of Compound Similarity</title>
      <p>
        The work closest to ours is probably that of Marc Ehrig et
al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In APFEL weights for each feature is calculated using
Decision Trees. The user only has to provide some ontologies
with known correct alignments. The learned decision tree is
used for aggregation and interpretation of the similarities.
3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Proposed Method</title>
      <p>
        We first proposed a method to select appropriate metrics
among existing set, and then introduce a method to
combine them as a compound similarity. To use Precision, Recall,
F-measure and Accuracy for metrics evaluation, it is needed
to do mapping extraction. It depends on the definition of
a Threshold value and the approach for extracting as well
as on some defined constraints. Such dependencies results in
in-appropriateness of current evaluation methods, although
methods like what defines in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] used to compare quality of
metrics. We propose a new method for evaluation of metrics
and creating a compound metric from some of them,
featuring independent of mapping extraction phase, using learning.
Usually String and Linguistic based metrics are more
influential than others and therefore if we want to select some
metrics among existing metrics, most of the selected ones are
linguistic which results in lower performance and flexibility
of algorithm on different inputs. Therefor as a input for the
training set, a number of metrics with their associated
category is considered. Categories are for example String Metric,
Linguistic Metric, Structural Metric and so on. Proposed
algorithm selects one metric from each category. Furthermore,
to enforce the algorithm to use a specific metric we can define
a new category and introduce the metric as the only member
of it. Like other learning based methods, it needs an initial
training phase. In this phase a train set - an ontology pair
with actual mappings in them - is given to the algorithm.
3.1
      </p>
    </sec>
    <sec id="sec-6">
      <title>Learning Phase</title>
      <p>In our algorithm, selection of appropriate metrics and
aggregation of them are done based on Data Mining Techniques.
3.1.1</p>
      <sec id="sec-6-1">
        <title>Reduction to a Data Mining Problem</title>
        <p>For a pair of Ontologies a table is created with rows showing
comparison of an entity from first ontology to an entity from
the second one. For each metric under consideration a column
is created in such a table with values showing normalized
metric value for the pair of entities. An additional column
with true or false values shows the existence of actual mapping
between the two entities is also considered.</p>
        <p>One table is created for each pair of Ontologies in the
training set. Then all of such tables are aggregated in a single table.
In this table, the column representing actual mapping value
between a pair of entities is considered as target variable and
the rest of columns are predictors. The problem now is a
typical data mining problem and so we can apply classic data
mining techniques to solve it. Fig. 1 shows the process. In
this figure Real Results part shows the real mappings among
entities of ontologies which are required during learning phase,
and the Sensitivity Analysis Rectangle shows the results which
are gain after sensitivity analysis, showing the
appropriateness of metrics on the given train set.
3.1.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Selection of Appropriate Metrics</title>
        <p>
          In what following, we analysis the problem using Neural
Networks as well as CART 2 and C5.0 decision tress[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. As
mentioned before, columns of the table corresponding to values of
metrics are considered as Predictors and the actual mapping
value is the target variable. Fig. 1 shows the process. The aim
is to find metrics having most influence in prediction of the
target variable using Data Mining Models:
        </p>
        <p>Neural Networks: Sensitivity Analysis for any problem
is applied after a model has been constructed. With varying
the values of input variables in the acceptable interval, the
output variation is measured. With the interpretation of the
output variation it is possible to recognize most influential
input variable. After giving average value for each input
variable to the model and measuring the output of the model,
Sensitivity Analysis for each variable is done separately. To
do this, the values of all variables except one in consideration
are kept constant (their average value) and the model’s
response for minimum and maximum values of the variable in
consideration are calculated. This process is repeated for all
variables and then the variables with higher influence on
variance of output are selected as most influential variables. For
our problem it means that the metric having most variation
on output during analysis is the most important metric.</p>
        <p>Decision Trees: After creating the root node, in each
iteration of the algorithm, a node is added to the decision tree.
This process is repeated until the expansion of the tree is not
possible anymore considering some predefined constraints.
Selection of a variable as next node in the tree is done based on
information theory concepts – in each repetition a variable
with higher influence is selected among candidates. Therefore
as a node is more near to the root, its corresponding variable
has higher influence on the value of target variable. Hence
from the constructed decision tree, it is possible to say that
the metric in the root node has the highest influence.
3.1.3</p>
      </sec>
      <sec id="sec-6-3">
        <title>Calculation of the Compound Metric</title>
        <p>
          According to the results, and considering step 3-1, the
problem is reduced to a Data Mining problem with the goal of
finding an algorithm to compute target variable based on the
predictor variables. In the Data Mining area several solutions
have been proposed for these kind of problems. Among
existing Data Mining solutions, we can refer to CART and C5.0
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] decision trees, A Priori for Association Rules generation
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and Neural Networks [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Based on initial results among
these methods, only Neural Networks has showed acceptable
results. Neural Networks, have similar behavior with popular
Alignment methods and they calculate Compound Similarity
in the form of Weighted Sum with the weights is adjusted
during learning.
        </p>
        <p>Similar to the evaluation method, a table is constructed.
As before, columns are the values selected metrics and an
additional column records the target variable (0 or 1) showing
the existence of a mapping between two entities. Now having
such training samples a Neural Network Model is built. It is
like a combined metric from the selected metrics which can
be used as a new metric for the extraction phase.
2 Classification And Regression Trees</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Using the Proposed Method</title>
      <p>
        To simplify the problem only String Based similarity metrics
are considered. For the initial set of metrics we consider
following metrics: the Levenshtein distance [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] which used the
Edit Distance to match two strings, the Needleman-Wunsch
distance[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which assigns a different cost on the edit
operations, the Smith-Waterman [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which additionally uses an
alphabet mapping to costs, the Monge-Elkan [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which uses
variable costs depending on the substring gaps between the
words , the Stoilos similarity [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] which try to modify existing
approaches for entities of an ontology, Jaro-Winkler
similarity [
        <xref ref-type="bibr" rid="ref14 ref5">5, 14</xref>
        ], which counts the common characters between two
strings even if they are misplaced by a ”short” distance, and
the Sub-string distance [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] which searches for the largest
common substring. EON2004 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] data set is used as the training
set which is explained below: Group 1xx: We only use test
103 from this group. Names of entities in this group is
remaining without any changing and cause this group not to
be a suitable data set for evaluation of string metrics. Group
2xx: The reference ontology is compared with a modified one.
Tests 204, 205, 221, 223 and 224 are used from this group.
Group 3xx: We use tests 302, 303 and 304 from this group.
The reference ontology is compared with real-life ontologies.
All: We merged all the data from described sets.
      </p>
      <p>Each comparison of two strings is assigned a similarity
degree. After collecting output for each metric, we evaluate them
for each data set as it is described in Sect. 2. Fig 2 shows the
results of applying Sensitivity Analysis on each test set after
normalization. Levenshtein similarity is the most important
one. Besides Sensitivity Analysis, Decision Tree models are
also used to confirm the results. In Table 1 we compare
results of these techniques. All of three tests agree about
importance of Levenshtein similarity on the test set. Neural
Network chooses Levenshtein while C5.0 and CART select it as
second suitable metric. According to the presented algorithm
and considering the fact that only one category is introduced
as input, only Levenshtein is selected. In a more real
situation the above steps are done for each category and one metric
from each category is selected. Levenshtein and Jaro-Winkler
are selected (from two imaginary categories). After creating a
neural network with 4 layers and evaluation of the model on
3xx test set, we got the convincing results.</p>
      <p>Most 2 important metrics
5</p>
    </sec>
    <sec id="sec-8">
      <title>Conclusions</title>
      <p>One advantage of the evaluation method is the uniform
treatment of Similarity and Distance metrics so that we don’t need
to differentiate and process them separately. This is because
in Data Mining evaluation, methods, there is no difference
between a variable and a linear form of it. The alignment
method can be improved when new metrics are introduced.
In such cases it is only needed to add some new columns and
do learning to adjust weights. Some of the researchers have
emphasized on clustering and application of metrics for
clusters as their future works. Another advantage of this method
is that we can add cluster value as a new column to influence
its importance for combination of metrics.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Imielinski</surname>
          </string-name>
          , and et al, '
          <article-title>Mining association rules between sets of items in large databases'</article-title>
          ,
          <source>in ACM SIGMOD Intl. Conf. Management of Data</source>
          , (May
          <year>1993</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Bouquet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Marc</given-names>
            <surname>Ehrig</surname>
          </string-name>
          , and et al, '
          <article-title>Specification of a common framework for characterizing alignment'</article-title>
          ,
          <source>deliverable 2.2</source>
          .1,
          <string-name>
            <surname>Knowledge</surname>
            <given-names>web NoE</given-names>
          </string-name>
          , (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Marc</given-names>
            <surname>Ehrig</surname>
          </string-name>
          , Staab Staab, and York Sure, '
          <article-title>Bootstrapping ontology alignment methods with APFEL'</article-title>
          ,
          <source>in Proceedings of the 4th International Semantic Web Conference (ISWC-</source>
          <year>2005</year>
          ), eds.,
          <string-name>
            <surname>Yolanda</surname>
            <given-names>Gil</given-names>
          </string-name>
          , Enrico Motta, and Richard Benjamins, Lecture Notes in Computer Science, pp.
          <fpage>186</fpage>
          -
          <lpage>200</lpage>
          , (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J</given-names>
            <surname>´erˆome</surname>
          </string-name>
          <string-name>
            <surname>Euzenat</surname>
          </string-name>
          , Thanh Le Bach, and et al, '
          <article-title>State of the art on ontology alignment'</article-title>
          ,
          <source>deliverable D2.2</source>
          .3,
          <string-name>
            <surname>Knowledge</surname>
            <given-names>web NoE</given-names>
          </string-name>
          , (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jaro</surname>
          </string-name>
          , '
          <article-title>Probabilistic Linkage of Large Public Health Data Files'</article-title>
          ,
          <source>Molecular Biology</source>
          ,
          <volume>14</volume>
          ,
          <fpage>491</fpage>
          -
          <lpage>498</lpage>
          , (
          <year>1995</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Daniel</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Larose</surname>
          </string-name>
          ,
          <article-title>Discovering Knowledge In Data</article-title>
          , John Wiley and Sons, New Jersey, USA,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Vladimir</given-names>
            <surname>Levenshtein</surname>
          </string-name>
          , '
          <article-title>Binary Codes Capable of Correcting Deletions, Insertions</article-title>
          and Reversals',
          <source>Soviet Physics-Doklady</source>
          ,
          <volume>10</volume>
          ,
          <fpage>707</fpage>
          -
          <lpage>710</lpage>
          , (
          <year>August 1966</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Melnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          et al, '
          <article-title>A versatile graph matching algorithm'</article-title>
          ,
          <source>in Proceedings of ICDE</source>
          , (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Alvaro</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Monge</surname>
          </string-name>
          and Charles P. Elkan, '
          <article-title>The Field-Matching Problem: Algorithm and Applications'</article-title>
          ,
          <source>in Proceedings of the second international Conference on Knowledge Discovery and Data Mining</source>
          , (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.B.</given-names>
            <surname>Needleman</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.D.</given-names>
            <surname>Wunsch</surname>
          </string-name>
          , '
          <string-name>
            <given-names>A General</given-names>
            <surname>Method</surname>
          </string-name>
          <article-title>Applicable to the Search for Similarities in the Amino Acid Sequence of two Proteins'</article-title>
          ,
          <source>Molecular Biology</source>
          ,
          <volume>48</volume>
          , (
          <year>1970</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.F.</given-names>
            <surname>Smith</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.S.</given-names>
            <surname>Waterman</surname>
          </string-name>
          , 'Identification of Common Molecular Subsequences',
          <source>Molecular Biology</source>
          ,
          <volume>147</volume>
          , (
          <year>1981</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Giorgos</surname>
            <given-names>Stoilos</given-names>
          </string-name>
          , Giorgos Stamou, and et al, '
          <article-title>A String Metric for Ontology Alignment'</article-title>
          ,
          <source>in Proceedings of the ninth IEEE International Symposium on Wearable Computers</source>
          , pp.
          <fpage>624</fpage>
          -
          <lpage>237</lpage>
          , (
          <year>October 2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sure</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          , and et al, eds.
          <source>Proceedings of the 3rd Evaluation of Ontology-based tools</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>William</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Winkler</surname>
          </string-name>
          , '
          <article-title>The State Record Linkage</article-title>
          and Current Research Problems',
          <source>Technical report, U. S. Bureau of the Census, Statistical Research Division, Room 3000-4</source>
          , Bureau of the Census, Washington, DC,
          <fpage>20233</fpage>
          -
          <lpage>9100</lpage>
          USA, (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>