Combining Ontology Alignment Metrics Using the Data Mining Techniques Babak Bagheri Hariri and Hassan Sayyadi and Hassan Abolhassani and Kyumars Sheykh Esmaili 1 1 Abstract. Several metrics have been proposed for recog- 2 Existing Works nition of relationships between elements of two Ontologies. Many of these methods select a number of such metrics and Works on metric evaluation as well as a method for aggregat- combine them to extract existing mappings. In this article, ing results of different metrics is introduced in this section. we present a method for selection of more effective metrics – based on data mining techniques. Furthermore, by having a 2.1 Alignment Evaluation Techniques set of metrics, we suggest a data-mining-like means for com- bining them into a better ontology alignment. Many of the algorithms and articles in Ontology Alignment context uses Precision and Recall or their harmonic mean, referred to as F-Measure, to evaluate the performance of a method [4]. Also in some articles, they are used to evaluate 1 Introduction alignment metrics[12]. In such methods after aggregation of results attained from different metrics, and extraction of map- Ontology Alignment is an essential tool in semantic web to pings – based on one of the methods mentioned in [4] – the overcome heterogeneity of data, which is an integral attribute resulting mappings are compared with actual results. of web. In [2], Ontology Alignment is defined as a set of corre- In [8] a method for evaluation of Ontology Alignment meth- spondences between two or more ontologies. These correspon- ods – Accuracy – is proposed. This quality metric is based dences are expressed as mappings, in which Mapping is a for- upon user effort needed to transform a match result obtained mal expression, that states the semantic relation between two automatically into the intended result. entities belonging to different ontologies. There have been sev- 1 eral proposals for drawing mappings in Ontology Alignment. Accuracy = Recall × (2 − ) (1) P recision Many of them define some metrics to measure Similarity or Distance of entities and find existing mappings using them [4]. To extract mappings, in most of these methods, couples 2.2 Calculation of Compound Similarity having Compound Similarity higher than a predefined thresh- old – after applying a number of constraints – are selected as The work closest to ours is probably that of Marc Ehrig et output. [4] contains a number of such methods. al. [3]. In APFEL weights for each feature is calculated using In this paper, given several similarity metrics we are trying Decision Trees. The user only has to provide some ontologies to determine which of them is best for a particular data set, with known correct alignments. The learned decision tree is using data mining techniques. In order to do that, we train used for aggregation and interpretation of the similarities. our techniques on some mappings for which we have a gold standard alignment, determining which metric is the best pre- 3 Proposed Method dictor of the correct alignment. We consider such metrics to be the best, and calculate Compound Similarity using them. We first proposed a method to select appropriate metrics The rest of this article is organized as follows. In section 2, among existing set, and then introduce a method to com- a review of related works in evaluation of existing methods bine them as a compound similarity. To use Precision, Recall, and calculation of compound similarity are given. Section 3 F-measure and Accuracy for metrics evaluation, it is needed reports our proposed method. In section 4 an example of ap- to do mapping extraction. It depends on the definition of plying this method is shown. Finally in section 5, discusses a Threshold value and the approach for extracting as well on its advantages and disadvantages are explained. as on some defined constraints. Such dependencies results in in-appropriateness of current evaluation methods, although 1 methods like what defines in [12] used to compare quality of Semantic Web Laboratory, Computer Engineering De- partment, Sharif University of Technology, Tehran, Iran, metrics. We propose a new method for evaluation of metrics email: {hariri,sayyadi}@ce.sharif.edu, abolhassani@sharif.edu, and creating a compound metric from some of them, featur- shesmail@ce.sharif.edu ing independent of mapping extraction phase, using learning. Usually String and Linguistic based metrics are more influ- 3.1.2 Selection of Appropriate Metrics ential than others and therefore if we want to select some metrics among existing metrics, most of the selected ones are In what following, we analysis the problem using Neural Net- linguistic which results in lower performance and flexibility works as well as CART 2 and C5.0 decision tress[6]. As men- of algorithm on different inputs. Therefor as a input for the tioned before, columns of the table corresponding to values of training set, a number of metrics with their associated cate- metrics are considered as Predictors and the actual mapping gory is considered. Categories are for example String Metric, value is the target variable. Fig. 1 shows the process. The aim Linguistic Metric, Structural Metric and so on. Proposed al- is to find metrics having most influence in prediction of the gorithm selects one metric from each category. Furthermore, target variable using Data Mining Models: to enforce the algorithm to use a specific metric we can define Neural Networks: Sensitivity Analysis for any problem a new category and introduce the metric as the only member is applied after a model has been constructed. With varying of it. Like other learning based methods, it needs an initial the values of input variables in the acceptable interval, the training phase. In this phase a train set - an ontology pair output variation is measured. With the interpretation of the with actual mappings in them - is given to the algorithm. output variation it is possible to recognize most influential input variable. After giving average value for each input vari- able to the model and measuring the output of the model, Sensitivity Analysis for each variable is done separately. To 3.1 Learning Phase do this, the values of all variables except one in consideration are kept constant (their average value) and the model’s re- In our algorithm, selection of appropriate metrics and aggre- sponse for minimum and maximum values of the variable in gation of them are done based on Data Mining Techniques. consideration are calculated. This process is repeated for all variables and then the variables with higher influence on vari- ance of output are selected as most influential variables. For our problem it means that the metric having most variation 3.1.1 Reduction to a Data Mining Problem on output during analysis is the most important metric. Decision Trees: After creating the root node, in each it- eration of the algorithm, a node is added to the decision tree. For a pair of Ontologies a table is created with rows showing This process is repeated until the expansion of the tree is not comparison of an entity from first ontology to an entity from possible anymore considering some predefined constraints. Se- the second one. For each metric under consideration a column lection of a variable as next node in the tree is done based on is created in such a table with values showing normalized information theory concepts – in each repetition a variable metric value for the pair of entities. An additional column with higher influence is selected among candidates. Therefore with true or false values shows the existence of actual mapping as a node is more near to the root, its corresponding variable between the two entities is also considered. has higher influence on the value of target variable. Hence from the constructed decision tree, it is possible to say that the metric in the root node has the highest influence. 3.1.3 Calculation of the Compound Metric According to the results, and considering step 3-1, the prob- lem is reduced to a Data Mining problem with the goal of finding an algorithm to compute target variable based on the predictor variables. In the Data Mining area several solutions have been proposed for these kind of problems. Among exist- ing Data Mining solutions, we can refer to CART and C5.0 Figure 1. Proposed evaluation technique in detail [6] decision trees, A Priori for Association Rules generation [1] and Neural Networks [6]. Based on initial results among these methods, only Neural Networks has showed acceptable results. Neural Networks, have similar behavior with popular One table is created for each pair of Ontologies in the train- Alignment methods and they calculate Compound Similarity ing set. Then all of such tables are aggregated in a single table. in the form of Weighted Sum with the weights is adjusted In this table, the column representing actual mapping value during learning. between a pair of entities is considered as target variable and Similar to the evaluation method, a table is constructed. the rest of columns are predictors. The problem now is a typ- As before, columns are the values selected metrics and an ical data mining problem and so we can apply classic data additional column records the target variable (0 or 1) showing mining techniques to solve it. Fig. 1 shows the process. In the existence of a mapping between two entities. Now having this figure Real Results part shows the real mappings among such training samples a Neural Network Model is built. It is entities of ontologies which are required during learning phase, like a combined metric from the selected metrics which can and the Sensitivity Analysis Rectangle shows the results which be used as a new metric for the extraction phase. are gain after sensitivity analysis, showing the appropriate- 2 Classification And Regression Trees ness of metrics on the given train set. 4 Using the Proposed Method Neural Network CART C5.0 Levenshtein Jaro-Winkler Needleman-Wunsch To simplify the problem only String Based similarity metrics SubString Levenshtein Levenshtein are considered. For the initial set of metrics we consider fol- lowing metrics: the Levenshtein distance [7] which used the Table 1. Most 2 important metrics Edit Distance to match two strings, the Needleman-Wunsch distance[10], which assigns a different cost on the edit oper- ations, the Smith-Waterman [11], which additionally uses an alphabet mapping to costs, the Monge-Elkan [9], which uses 5 Conclusions variable costs depending on the substring gaps between the One advantage of the evaluation method is the uniform treat- words , the Stoilos similarity [12] which try to modify existing ment of Similarity and Distance metrics so that we don’t need approaches for entities of an ontology, Jaro-Winkler similar- to differentiate and process them separately. This is because ity [5, 14], which counts the common characters between two in Data Mining evaluation, methods, there is no difference strings even if they are misplaced by a ”short” distance, and between a variable and a linear form of it. The alignment the Sub-string distance [4] which searches for the largest com- method can be improved when new metrics are introduced. mon substring. EON2004 [13] data set is used as the training In such cases it is only needed to add some new columns and set which is explained below: Group 1xx : We only use test do learning to adjust weights. Some of the researchers have 103 from this group. Names of entities in this group is re- emphasized on clustering and application of metrics for clus- maining without any changing and cause this group not to ters as their future works. Another advantage of this method be a suitable data set for evaluation of string metrics. Group is that we can add cluster value as a new column to influence 2xx : The reference ontology is compared with a modified one. its importance for combination of metrics. Tests 204, 205, 221, 223 and 224 are used from this group. Group 3xx : We use tests 302, 303 and 304 from this group. The reference ontology is compared with real-life ontologies. REFERENCES All: We merged all the data from described sets. [1] R. Agrawal, T. Imielinski, and et al, ‘Mining association rules Each comparison of two strings is assigned a similarity de- between sets of items in large databases’, in ACM SIGMOD gree. After collecting output for each metric, we evaluate them Intl. Conf. Management of Data, (May 1993). for each data set as it is described in Sect. 2. Fig 2 shows the [2] Paolo Bouquet, Marc Ehrig, and et al, ‘Specification of a results of applying Sensitivity Analysis on each test set after common framework for characterizing alignment’, deliverable 2.2.1, Knowledge web NoE, (2004). normalization. Levenshtein similarity is the most important [3] Marc Ehrig, Staab Staab, and York Sure, ‘Bootstrapping on- one. Besides Sensitivity Analysis, Decision Tree models are tology alignment methods with APFEL’, in Proceedings of the 4th International Semantic Web Conference (ISWC-2005), eds., Yolanda Gil, Enrico Motta, and Richard Benjamins, Lec- ture Notes in Computer Science, pp. 186–200, (2005). [4] Jérôme Euzenat, Thanh Le Bach, and et al, ‘State of the art on ontology alignment’, deliverable D2.2.3, Knowledge web NoE, (2004). [5] M. Jaro, ‘Probabilistic Linkage of Large Public Health Data Files’, Molecular Biology, 14, 491–498, (1995). [6] Daniel T. Larose, Discovering Knowledge In Data, John Wi- ley and Sons, New Jersey, USA, 2005. [7] Vladimir Levenshtein, ‘Binary Codes Capable of Correcting Deletions, Insertions and Reversals’, Soviet Physics-Doklady, 10, 707–710, (August 1966). [8] S. Melnik, H. Garcia-Molina, and et al, ‘A versatile graph matching algorithm’, in Proceedings of ICDE, (2002). [9] Alvaro E. Monge and Charles P. Elkan, ‘The Field-Matching Problem: Algorithm and Applications’, in Proceedings of the Figure 2. Evaluation of string metrics using Neural Networks second international Conference on Knowledge Discovery and Data Mining, (1996). [10] S.B. Needleman and C.D. Wunsch, ‘A General Method Ap- plicable to the Search for Similarities in the Amino Acid Se- quence of two Proteins’, Molecular Biology, 48, (1970). [11] T.F. Smith and M.S. Waterman, ‘Identification of Common also used to confirm the results. In Table 1 we compare re- Molecular Subsequences’, Molecular Biology, 147, (1981). sults of these techniques. All of three tests agree about im- [12] Giorgos Stoilos, Giorgos Stamou, and et al, ‘A String Metric portance of Levenshtein similarity on the test set. Neural Net- for Ontology Alignment’, in Proceedings of the ninth IEEE work chooses Levenshtein while C5.0 and CART select it as International Symposium on Wearable Computers, pp. 624– 237, (October 2005). second suitable metric. According to the presented algorithm [13] Y. Sure, O. Corcho, and et al, eds. Proceedings of the 3rd and considering the fact that only one category is introduced Evaluation of Ontology-based tools, 2004. as input, only Levenshtein is selected. In a more real situa- [14] William E. Winkler, ‘The State Record Linkage and Current tion the above steps are done for each category and one metric Research Problems’, Technical report, U. S. Bureau of the from each category is selected. Levenshtein and Jaro-Winkler Census, Statistical Research Division, Room 3000-4, Bureau of the Census, Washington, DC, 20233-9100 USA, (1999). are selected (from two imaginary categories). After creating a neural network with 4 layers and evaluation of the model on 3xx test set, we got the convincing results.