Combining Ontology Alignment Metrics Using
              the Data Mining Techniques
                 Babak Bagheri Hariri and Hassan Sayyadi and Hassan Abolhassani
                                   and Kyumars Sheykh Esmaili 1
                                                                     1


Abstract. Several metrics have been proposed for recog-                  2     Existing Works
nition of relationships between elements of two Ontologies.
Many of these methods select a number of such metrics and                Works on metric evaluation as well as a method for aggregat-
combine them to extract existing mappings. In this article,              ing results of different metrics is introduced in this section.
we present a method for selection of more effective metrics –
based on data mining techniques. Furthermore, by having a
                                                                         2.1    Alignment Evaluation Techniques
set of metrics, we suggest a data-mining-like means for com-
bining them into a better ontology alignment.                            Many of the algorithms and articles in Ontology Alignment
                                                                         context uses Precision and Recall or their harmonic mean,
                                                                         referred to as F-Measure, to evaluate the performance of a
                                                                         method [4]. Also in some articles, they are used to evaluate
1      Introduction                                                      alignment metrics[12]. In such methods after aggregation of
                                                                         results attained from different metrics, and extraction of map-
Ontology Alignment is an essential tool in semantic web to               pings – based on one of the methods mentioned in [4] – the
overcome heterogeneity of data, which is an integral attribute           resulting mappings are compared with actual results.
of web. In [2], Ontology Alignment is defined as a set of corre-            In [8] a method for evaluation of Ontology Alignment meth-
spondences between two or more ontologies. These correspon-              ods – Accuracy – is proposed. This quality metric is based
dences are expressed as mappings, in which Mapping is a for-             upon user effort needed to transform a match result obtained
mal expression, that states the semantic relation between two            automatically into the intended result.
entities belonging to different ontologies. There have been sev-
                                                                                                                     1
eral proposals for drawing mappings in Ontology Alignment.                          Accuracy = Recall × (2 −               )        (1)
                                                                                                                P recision
Many of them define some metrics to measure Similarity or
Distance of entities and find existing mappings using them
[4]. To extract mappings, in most of these methods, couples              2.2    Calculation of Compound Similarity
having Compound Similarity higher than a predefined thresh-
old – after applying a number of constraints – are selected as           The work closest to ours is probably that of Marc Ehrig et
output. [4] contains a number of such methods.                           al. [3]. In APFEL weights for each feature is calculated using
   In this paper, given several similarity metrics we are trying         Decision Trees. The user only has to provide some ontologies
to determine which of them is best for a particular data set,            with known correct alignments. The learned decision tree is
using data mining techniques. In order to do that, we train              used for aggregation and interpretation of the similarities.
our techniques on some mappings for which we have a gold
standard alignment, determining which metric is the best pre-            3     Proposed Method
dictor of the correct alignment. We consider such metrics to
be the best, and calculate Compound Similarity using them.               We first proposed a method to select appropriate metrics
   The rest of this article is organized as follows. In section 2,       among existing set, and then introduce a method to com-
a review of related works in evaluation of existing methods              bine them as a compound similarity. To use Precision, Recall,
and calculation of compound similarity are given. Section 3              F-measure and Accuracy for metrics evaluation, it is needed
reports our proposed method. In section 4 an example of ap-              to do mapping extraction. It depends on the definition of
plying this method is shown. Finally in section 5, discusses             a Threshold value and the approach for extracting as well
on its advantages and disadvantages are explained.                       as on some defined constraints. Such dependencies results in
                                                                         in-appropriateness of current evaluation methods, although
1
                                                                         methods like what defines in [12] used to compare quality of
      Semantic Web Laboratory, Computer Engineering De-
    partment, Sharif University of Technology, Tehran, Iran,             metrics. We propose a new method for evaluation of metrics
    email: {hariri,sayyadi}@ce.sharif.edu, abolhassani@sharif.edu,       and creating a compound metric from some of them, featur-
    shesmail@ce.sharif.edu                                               ing independent of mapping extraction phase, using learning.
Usually String and Linguistic based metrics are more influ-          3.1.2   Selection of Appropriate Metrics
ential than others and therefore if we want to select some
metrics among existing metrics, most of the selected ones are        In what following, we analysis the problem using Neural Net-
linguistic which results in lower performance and flexibility        works as well as CART 2 and C5.0 decision tress[6]. As men-
of algorithm on different inputs. Therefor as a input for the        tioned before, columns of the table corresponding to values of
training set, a number of metrics with their associated cate-        metrics are considered as Predictors and the actual mapping
gory is considered. Categories are for example String Metric,        value is the target variable. Fig. 1 shows the process. The aim
Linguistic Metric, Structural Metric and so on. Proposed al-         is to find metrics having most influence in prediction of the
gorithm selects one metric from each category. Furthermore,          target variable using Data Mining Models:
to enforce the algorithm to use a specific metric we can define         Neural Networks: Sensitivity Analysis for any problem
a new category and introduce the metric as the only member           is applied after a model has been constructed. With varying
of it. Like other learning based methods, it needs an initial        the values of input variables in the acceptable interval, the
training phase. In this phase a train set - an ontology pair         output variation is measured. With the interpretation of the
with actual mappings in them - is given to the algorithm.            output variation it is possible to recognize most influential
                                                                     input variable. After giving average value for each input vari-
                                                                     able to the model and measuring the output of the model,
                                                                     Sensitivity Analysis for each variable is done separately. To
3.1      Learning Phase                                              do this, the values of all variables except one in consideration
                                                                     are kept constant (their average value) and the model’s re-
In our algorithm, selection of appropriate metrics and aggre-        sponse for minimum and maximum values of the variable in
gation of them are done based on Data Mining Techniques.             consideration are calculated. This process is repeated for all
                                                                     variables and then the variables with higher influence on vari-
                                                                     ance of output are selected as most influential variables. For
                                                                     our problem it means that the metric having most variation
3.1.1    Reduction to a Data Mining Problem                          on output during analysis is the most important metric.
                                                                        Decision Trees: After creating the root node, in each it-
                                                                     eration of the algorithm, a node is added to the decision tree.
For a pair of Ontologies a table is created with rows showing        This process is repeated until the expansion of the tree is not
comparison of an entity from first ontology to an entity from        possible anymore considering some predefined constraints. Se-
the second one. For each metric under consideration a column         lection of a variable as next node in the tree is done based on
is created in such a table with values showing normalized            information theory concepts – in each repetition a variable
metric value for the pair of entities. An additional column          with higher influence is selected among candidates. Therefore
with true or false values shows the existence of actual mapping      as a node is more near to the root, its corresponding variable
between the two entities is also considered.                         has higher influence on the value of target variable. Hence
                                                                     from the constructed decision tree, it is possible to say that
                                                                     the metric in the root node has the highest influence.


                                                                     3.1.3   Calculation of the Compound Metric
                                                                     According to the results, and considering step 3-1, the prob-
                                                                     lem is reduced to a Data Mining problem with the goal of
                                                                     finding an algorithm to compute target variable based on the
                                                                     predictor variables. In the Data Mining area several solutions
                                                                     have been proposed for these kind of problems. Among exist-
                                                                     ing Data Mining solutions, we can refer to CART and C5.0
        Figure 1.   Proposed evaluation technique in detail
                                                                     [6] decision trees, A Priori for Association Rules generation
                                                                     [1] and Neural Networks [6]. Based on initial results among
                                                                     these methods, only Neural Networks has showed acceptable
                                                                     results. Neural Networks, have similar behavior with popular
   One table is created for each pair of Ontologies in the train-    Alignment methods and they calculate Compound Similarity
ing set. Then all of such tables are aggregated in a single table.   in the form of Weighted Sum with the weights is adjusted
In this table, the column representing actual mapping value          during learning.
between a pair of entities is considered as target variable and         Similar to the evaluation method, a table is constructed.
the rest of columns are predictors. The problem now is a typ-        As before, columns are the values selected metrics and an
ical data mining problem and so we can apply classic data            additional column records the target variable (0 or 1) showing
mining techniques to solve it. Fig. 1 shows the process. In          the existence of a mapping between two entities. Now having
this figure Real Results part shows the real mappings among          such training samples a Neural Network Model is built. It is
entities of ontologies which are required during learning phase,     like a combined metric from the selected metrics which can
and the Sensitivity Analysis Rectangle shows the results which       be used as a new metric for the extraction phase.
are gain after sensitivity analysis, showing the appropriate-
                                                                     2 Classification And Regression Trees
ness of metrics on the given train set.
4   Using the Proposed Method                                          Neural Network              CART                   C5.0
                                                                         Levenshtein            Jaro-Winkler        Needleman-Wunsch
To simplify the problem only String Based similarity metrics              SubString              Levenshtein           Levenshtein
are considered. For the initial set of metrics we consider fol-
lowing metrics: the Levenshtein distance [7] which used the
                                                                                 Table 1.    Most 2 important metrics
Edit Distance to match two strings, the Needleman-Wunsch
distance[10], which assigns a different cost on the edit oper-
ations, the Smith-Waterman [11], which additionally uses an
alphabet mapping to costs, the Monge-Elkan [9], which uses         5    Conclusions
variable costs depending on the substring gaps between the
                                                                   One advantage of the evaluation method is the uniform treat-
words , the Stoilos similarity [12] which try to modify existing
                                                                   ment of Similarity and Distance metrics so that we don’t need
approaches for entities of an ontology, Jaro-Winkler similar-
                                                                   to differentiate and process them separately. This is because
ity [5, 14], which counts the common characters between two
                                                                   in Data Mining evaluation, methods, there is no difference
strings even if they are misplaced by a ”short” distance, and
                                                                   between a variable and a linear form of it. The alignment
the Sub-string distance [4] which searches for the largest com-
                                                                   method can be improved when new metrics are introduced.
mon substring. EON2004 [13] data set is used as the training
                                                                   In such cases it is only needed to add some new columns and
set which is explained below: Group 1xx : We only use test
                                                                   do learning to adjust weights. Some of the researchers have
103 from this group. Names of entities in this group is re-
                                                                   emphasized on clustering and application of metrics for clus-
maining without any changing and cause this group not to
                                                                   ters as their future works. Another advantage of this method
be a suitable data set for evaluation of string metrics. Group
                                                                   is that we can add cluster value as a new column to influence
2xx : The reference ontology is compared with a modified one.
                                                                   its importance for combination of metrics.
Tests 204, 205, 221, 223 and 224 are used from this group.
Group 3xx : We use tests 302, 303 and 304 from this group.
The reference ontology is compared with real-life ontologies.      REFERENCES
All: We merged all the data from described sets.                    [1] R. Agrawal, T. Imielinski, and et al, ‘Mining association rules
   Each comparison of two strings is assigned a similarity de-          between sets of items in large databases’, in ACM SIGMOD
gree. After collecting output for each metric, we evaluate them         Intl. Conf. Management of Data, (May 1993).
for each data set as it is described in Sect. 2. Fig 2 shows the    [2] Paolo Bouquet, Marc Ehrig, and et al, ‘Specification of a
results of applying Sensitivity Analysis on each test set after         common framework for characterizing alignment’, deliverable
                                                                        2.2.1, Knowledge web NoE, (2004).
normalization. Levenshtein similarity is the most important         [3] Marc Ehrig, Staab Staab, and York Sure, ‘Bootstrapping on-
one. Besides Sensitivity Analysis, Decision Tree models are             tology alignment methods with APFEL’, in Proceedings of the
                                                                        4th International Semantic Web Conference (ISWC-2005),
                                                                        eds., Yolanda Gil, Enrico Motta, and Richard Benjamins, Lec-
                                                                        ture Notes in Computer Science, pp. 186–200, (2005).
                                                                    [4] Jérôme Euzenat, Thanh Le Bach, and et al, ‘State of the art
                                                                        on ontology alignment’, deliverable D2.2.3, Knowledge web
                                                                        NoE, (2004).
                                                                    [5] M. Jaro, ‘Probabilistic Linkage of Large Public Health Data
                                                                        Files’, Molecular Biology, 14, 491–498, (1995).
                                                                    [6] Daniel T. Larose, Discovering Knowledge In Data, John Wi-
                                                                        ley and Sons, New Jersey, USA, 2005.
                                                                    [7] Vladimir Levenshtein, ‘Binary Codes Capable of Correcting
                                                                        Deletions, Insertions and Reversals’, Soviet Physics-Doklady,
                                                                        10, 707–710, (August 1966).
                                                                    [8] S. Melnik, H. Garcia-Molina, and et al, ‘A versatile graph
                                                                        matching algorithm’, in Proceedings of ICDE, (2002).
                                                                    [9] Alvaro E. Monge and Charles P. Elkan, ‘The Field-Matching
                                                                        Problem: Algorithm and Applications’, in Proceedings of the
Figure 2.   Evaluation of string metrics using Neural Networks          second international Conference on Knowledge Discovery
                                                                        and Data Mining, (1996).
                                                                   [10] S.B. Needleman and C.D. Wunsch, ‘A General Method Ap-
                                                                        plicable to the Search for Similarities in the Amino Acid Se-
                                                                        quence of two Proteins’, Molecular Biology, 48, (1970).
                                                                   [11] T.F. Smith and M.S. Waterman, ‘Identification of Common
also used to confirm the results. In Table 1 we compare re-             Molecular Subsequences’, Molecular Biology, 147, (1981).
sults of these techniques. All of three tests agree about im-      [12] Giorgos Stoilos, Giorgos Stamou, and et al, ‘A String Metric
portance of Levenshtein similarity on the test set. Neural Net-         for Ontology Alignment’, in Proceedings of the ninth IEEE
work chooses Levenshtein while C5.0 and CART select it as               International Symposium on Wearable Computers, pp. 624–
                                                                        237, (October 2005).
second suitable metric. According to the presented algorithm
                                                                   [13] Y. Sure, O. Corcho, and et al, eds. Proceedings of the 3rd
and considering the fact that only one category is introduced           Evaluation of Ontology-based tools, 2004.
as input, only Levenshtein is selected. In a more real situa-      [14] William E. Winkler, ‘The State Record Linkage and Current
tion the above steps are done for each category and one metric          Research Problems’, Technical report, U. S. Bureau of the
from each category is selected. Levenshtein and Jaro-Winkler            Census, Statistical Research Division, Room 3000-4, Bureau
                                                                        of the Census, Washington, DC, 20233-9100 USA, (1999).
are selected (from two imaginary categories). After creating a
neural network with 4 layers and evaluation of the model on
3xx test set, we got the convincing results.