Motivation

FORPM: Boosting Users' E ect on Ontology Matching

Dunwei Wen

dwwen@mail.csu.edu.cn 0 1

Xiaohu Fan

fanxiaohu17@gmail.com 1

Fuhua Lin

0 0 School of Computing and Information Systems, Athabasca University , Athabasca, Alberta T9S 3A3 , Canada fdunweiw 1 School of Information Science and Engineering, Central South University , Changsha, Hunan 410083 , China

In this paper, we attempt to view the ontology matching task from an information gaining angle. In our opinions, the information used for matching mainly comes from the matching tools as well as the human experts. With this understanding, we believe that by making good use of user e orts, we can also accelerate the matching process. Hence we present a prototype system named FORPM. First, it ranks the entities of the ontology. Important entities are chosen as centroids to form fragments. Then, users can use those centroids' information to estimate the content of the fragments and initially match them. Finally, automatic matching is carried out among those matched fragments. Experiment results obtained so far show that with a few user e orts, our approach signi cantly improves the matching e ciency while the loss of accuracy is acceptable.

Motivation

Ontology matching aims at nding semantic relationships between entities of di erent ontologies for solving the interoperation problem. From the viewpoint of information theory, we view a matching problem as a process of information gaining with uncertainty ( ). Let ' denotes the information obtained by the matching tool (from ontology itself as well as external source like WordNet), ! denotes the information provided by users in the validation step (we hold the same kind of opinions with [ 1 ] that fully automatic ontology matching is still impossible). To obtain matching results of high quality, we believe that the following equation has to be satis ed: ! + ' ( ): (1)

On one hand, to our best knowledge, recently numerous researches have been focused on how to maximize ' and made great progresses. On the other hand, we believe that the human users, especially the domain experts are capable of discovering complex relationships, such as more general ( ), less general ( ) etc., between candidate pairs. This extra information, however, is always ignored and human e ort is simply used in the validation step to judge simple relations such as matching or not matching. With these understandings, our work is aiming at making good use of the information provided by users, thus accelerating the matching process.

In this paper we propose: 1) an information theory based model for concept ranking and centroid extraction; 2) a clustering algorithm for ontology partitioning. To test our approch, we also introduce a prototype system called FORPM.

FORPM (Framework for Ontology Ranking, Partitioning and Matching)

FORPM is implemented with Java under JDK 5.0 and Eclipse 3.1.2. The system architecture is shown below in Fig 1. First, two ontologies are input and then transformed into DAG (Directed Acyclic Graph), where the is-a relations are transformed into arcs and concepts are transformed into nodes. After the four main process steps in the dash line, the result and a reference-mapping le are sent to the evaluation module, in which the evaluation results are generated automatically and presented to the user. Step 1: Entity Ranking Based on our observation, the amount of information provided by a user equals the sum of the amount of information Ii provided in T times of the user's validation.

If we assume that the cost of the user's every validation be the same, then one intuitive way to improve the matching e ciency is to maximize Ii wisely. Hence fundamental to our ranking approach is the ability to measure how much information is conveyed in a node thereby giving a sense of how much information the computer would gain by being informed about a discovered matching pair.

In information theory [ 2 ], the amount of information contained in an event is measured by the negative logarithm of the probability of occurrence of the event. Thus if is an event that has possible outcome values x1; x2; :::; xn occurring with probabilities pr1; pr2; :::; prn, the amount of information gained or uncertainty removed by knowing that has the outcome xi is given by:

I ( = xi) = log (pri) : (3) If we have a node ni 2 Nodes, then the amount of information contained in ni is: I ( = ai) =

log ( = ai) : I (ni) =

X I(ai): ai2arcsi

Based on this we can build a model to measure the amount of information of a node in an ontology graph by considering the concept as an event and is-a relations as its outcomes. Assume that in the ontology graph G(Arcs; Nodes), where Arcs is the set of all is-a relations and Nodes is the set of all concepts in ontology O. Then for any arc ai 2 Arcs, its probability is given by

Pr ( = ai) =

1 ! jarcsij : Here jarcsij is the number of arcs connecting with Node ni. Thus the amount of information contained in arc ai is: We use the amount of information to rank nodes. The node contains the most amount of information is de ned as an information center.

De nition 1 (Information Center/Centroid Node). Let G(N, A) be an ontology graph,

N be the node set, A be the arcs set, then node Ic 2 N is an information center if for any node ni 2 N:

I (Ic)

I (ni) : Step 2: IFC (Information Flooding theory based Clustering)

The goal of this step is to form fragments from centroids. We noticed that in the isa hierarchy tree, semantic similarity between two concepts often decays as the distance between them increases. In our work, we de ne an information ooding function to measure how strong a source node could a ect a target node.

De nition 2 (Information Flood). Let G (N, A) be an ontology graph, ni, n j 2N, we de ne the information ood from ni to n j as:

In f oFlood ni; n j = F Disti j I (ni) : (4) (5) (6) (7) (8) (9) Where I(ni) is the information contained in ni, F(Disti j) is a quadratic experiential decay function to simulate the attenuation of similarity de ned as

F(Disti j) =

1 a

Disti2j + b

Disti j + c : and Disti j is the number of arcs between node n j and node n j in the is-a relation hierarchy tree. In our experiment, we have a = 0:25; b = 0:5; c = 0.

De nition 3 (Fragment). Let O be an Ontology, Gi(N, A) be a graph representing part

of O, where N is the node set, A is the arc set. Let dList be the set of all centroid nodes in O. If for any d j 2 dList and all nk 2 N, we have di 2 dList which satis es

In f oFlood (di; nk) Max In f oFlood d j; nk :

(10)

Then we say Gi is a fragment f(di,mi) with mi as its size and di as its centroid node.

We brie y describe the partitioning algorithm in Table 1. The algorithm receives two parameters, Max, the upper bound of the number of the fragment, and Min, the lower bound of the size of the fragment. We set a max iteration number to ensure the stop of the algorithm. Also a merge algorithm is implemented to deal with the fragments whose size is below the lower bound.

Step 3: Manual Matching.

In this step, users use those centroid nodes to estimate the content of the fragments and then match them manually. A centroid node may have more than one counterparts with relations such as equivalence (=), more general( ), less general( ), mismatch (?) and overlapping (\). Two centroids are considered semantically matched [ 3 ] if the relation between them is not mismatch.

Step 4: Automatic Matching.

Two fragments are viewed as matched if their centroids are semantically matched, the remaining matching work between two matched fragments is the same as a normal task but of smaller sizes. Various approaches could be adopted here to nish the task. In FORPM, we employ the same string-based tech in [ 4 ] for demonstration. 3

Evaluation and Discussion

In our experiment, we adopted a dataset from [ 4 ]. The Russia1a contains 151 concepts while Russia1b contains 162 concepts, with 64 human con rmed mappings (concepts only). We used F-Measure [cf.4] and Cost (see blow) as quility metrics.

Cost =

#Compare T imes #Found Marched Pairs : (11)

In FORPM, users can tune the system by changing the value of Max and Min in step2. We can see in Fig. 2 that the more fragments there are, the more likely users are to make right judgememts which lower the cost. Meanwhile more human work is required (The user has to do Max*Max times validations at most). According to our experience, it seems that the program performs best when the parameter Max is set between 5% 10% of the total number of the concepts. 1 0.9 0.8 0.7

e 0.6 r

u 0.5 sea 0.4

M 0.3 F 0.2 0.1 0

Max

Concluding Remarks

In this paper, we have proposed an information gaining theory based framework for ontology matching. We have shown that with a few user e orts, our approach is e ective in reducing the matching complexity.

Our work is inspired by data mining technology. We gain our idea of information model from [ 5 ]. Our tool refers to [ 6 ]'s work in implementation, while [ 7 ] propose an automatic block based matching approach. Both [ 7 ] and our tool employ a ranking step to label the blocks. However, [ 7 ]'s rank step is after the partitioning step, while in FORPM, ranking step is rstly carried out since we employ an extraction-like clustering algorithm. 5

Acknowledgements

We thank National Science and Engineering Research Council of Canada, Academic Research Council of Athabasca University for their nancial support for the research, and thank anonymous reviewers for their detailed and constructive comments.

N.F.

Noy and

M.A.

Musen : The PROMPT suite: interactive tools for ontology merging and mapping . International Journal of Human-Computer Studies . 2003 , 59 ( 6 ): 983 - 1024 .

2. C.E. SHANNON: A Mathematical Theory of Communication . The Bell System Technical Journal .Vol. 27 , pp. 379 - 423 , 623 - 656 ,July, October, 1948 .

Paolo

Avesani , Fausto Giunchiglia, and Mikalai Yatskevich: A Large Scale Taxonomy Mapping Evaluation . ISWC 2005.LNCS 3729 , pp. 67 - 81 , 2005 .

Marc

Ehrig and Ste en Staab: QOM- Quick Ontology Mapping . The Semantic Web Proceedings ISWC'04 .pp. 289 - 303 , 2004 . Springer-Verlag LNCS 3298.

Kemafor

Anyanwu Angela Maduko and Amit Sheth : SemRank: Ranking Complex Relationship Search Results on the Semantic Web In: WWW 2005 May 10-14 , 2005 , Chiba, Japan.

Heiner

Stuckenschmidt and Michel Klein: Structure-Based Partitioning of Large Concept Hierarchies . In: The SemanticWeb Proceedings ISWC'04 pp. 289 - 303 , 2004 .

Hu ,

Zhao , Y. Qu: Partition-Based Block Matching of Large Class Hierarchies In Proceedings of ASWC 2006 , Beijing, China, 2006 .