Introduction

Towards a Benchmark for Instance Matching ?

Al o Ferrara

Davide Lorusso

Stefano Montanelli

Gaia Varese

vareseg@dico.unimi.it 0 0 Universita degli Studi di Milano, DICo , 10235 Milano , Italy

In the general eld of knowledge interoperability and ontology matching, instance matching is a crucial task for several applications, from identity recognition to data integration. The aim of instance matching is to detect instances referred to the same real-world object despite the di erences among their descriptions. Algorithms and techniques for instance matching have been proposed in literature, however the problem of their evaluation is still open. Furthermore, a widely recognized problem in the Semantic Web in general is the lack of evaluation data. While OAEI (Ontology Alignment Evaluation Initiative) has provided a reference benchmark for concept matching, evaluation data for instance matching are still few. In this paper, we provide a benchmark for instance matching, with the goal of taking into account the main requirements that instance matching algorithms should address.

Introduction

The increasing popularity of Semantic Web technologies makes the ontology matching process a crucial task. Ontology matching [ 1 ] aim is to (semi) automatically detect semantic correspondences between heterogeneous ontologies. It can be performed at two di erent levels: schema matching and instance matching. The objective of schema matching [ 2 ] is to nd out a set of mappings between concepts and properties in di erent ontologies, while the aim of instance matching is to detect instances referred to the same real-world object. When comparing di erent knowledge representations, ontologies' schemas should be merged, in terms of concepts and properties describing the domain. Then, mappings between di erent descriptions (i.e., ontologies' instances) of the same object should be discovered, in order to achieve the goal of providing a data integration system over Semantic Web sources.

Instance matching is also crucial in projects like OKKAM1 [ 3 ], where the main idea is that real-world objects' descriptions could be retrieved, univocally identi ed and shared over the Web.

Most research has been focused on schema level matching, while instance matching problem has been mainly studied in the database eld, in which it is more speci cally called record linkage problem [4{6]. However, as shown in the paper, instance matching brings new problems in comparison to record linkage and requires speci c technologies. 2

The Instance Matching Problem

The instance matching problem is de ned as follows. Given two instances i1 and i2, belonging to the same ontology or to di erent ontologies, instance matching is de ned as a function Im(i1; i2) ! f0; 1g, where 1 denotes the fact that i1 and i2 are referred to the same real-world object and 0 denotes the fact that i1 and i2 are referred to di erent objects.

In order to nd out properly if two individuals are referred to the same real-world object, an instance matching algorithm should satisfy di erent kinds of requirements. As shown in Figure 1, those can be divided in three main categories.

Requirements (management of:) Data value differences

Structural heterogeneity

Logical heterogeneity - Typographical errors - Use of different standard formats - Use of different levels of depth for properties representation - Use of different aggregation criteria for properties representation - Missing values specification - Instantiation on different sub-classes of the same super class - Instantiation on disjoint classes - Instantiation on different classes of a class hierarchy explicitly declared - Instantiation on different classes of a class hierarchy implicitly declared - Implicit values specification Data value di erences. An instance matching algorithm is required to recognize, as better as possible, corresponding values, even if data contain errors or are represented using di erent standard formats. This issue has been addressed in the eld of record linkage research, and the problem of comparing instances' property values is the same as comparing records' attribute values. Structural heterogeneity. Instances belonging to di erent ontologies can not only di er within their properties values, but they can also have di erent structures. While in record linkage the structure of records is usually given and schema and record matching are di erent problems, in instance matching, schema and instances are more strictly related. Thus, besides the capability to evaluate the level of similarity between property values, instance matching techniques have to go beyond heterogeneous individual representations by identifying the pairs of matching properties between two considered instances.

Logical heterogeneity. A speci c ontologies' matching problem, which is not taken into consideration in record linkage process, is the need to infer implicit knowledge, typically referred to concepts hierarchy within the ontologies. 3

Design of a Benchmark for Instance Matching

A widely recognized problem in the Semantic Web is the lack of evaluation data. While OAEI (Ontology Alignment Evaluation Initiative)2 [ 7 ] has provided a reference benchmark for concept matching, evaluation data for instance matching are still few. Further works dealing with concept matching evaluation are those published in ESWC 2008 [ 8, 9 ]. In particular, they argue that ontology matching techniques cannot be evaluated in an application independent way, since the same matching technique can produce di erent quality results based on the endto-end application that exploits the alignments.

In this paper, we provide a benchmark for instance matching. The aim of our benchmark is to take into account all the main requirements presented in the previous section and to provide a complete set of tests for instance matching algorithms evaluation. A contribution of our work is not only the de nition of a speci c benchmark, but also the de nition of a semi-automatic procedure for the generation of several di erent benchmarks. In Figure 2, the overall process of benchmarks generation is shown. As an example of this general procedure, we describe in the following a speci c instantiation of it, that is the creation of a speci c benchmark for instance matching. That benchmark is available at http://islab.dico.unimi.it/iimb/. 3.1

Reference ABox Generation

First of all, we chose a domain of interest (i.e., the domain of movie data), and we created a reference (ALCF (D)) TBox for it, based on our knowledge of the domain. The reference TBox is available at http://islab.dico.unimi.it/ontologies/benchmark/imdbT.owl. This contains 15 named classes, 5 object properties and 13 datatype properties. The reference TBox is then populated by automatically creating a reference ABox. Data are extracted from IMDb 3 by executing a query 2 http://oaei.ontologymatching.org/2007/benchmarks/. 3 http://www.imdb.com/.

User Query Modified ABoxes Generation input

IMDb

POPULATION

Reference

ABox MODIFIER Modified ABox 2

output Modified ABox 1 ...

Modified

ABox n where X is a variable specifying a word of our choice. Thus, all selected movies contain the word X in their title. The corresponding individuals in the reference ABox are referred to similar objects, but each of them represents a distinct object in the real world. As a consequence, each instance can be univocally identi ed. In order to get our reference ABox, we put X = Scarf ace. The reference ABox obtained in that way contains 302 individuals, that is all the movie objects matching the query and all the actors in the movie cast. Once the reference ABox is created, we generate a set of modi ed ABoxes, each consisting in a collection of instances obtained modifying the corresponding instances in the reference ABox. Transformations introduced in benchmark ABoxes can be distinguished into three main categories. In particular, each modi cation category simulates a speci c problem that can be found when comparing ontologies' instances, that is the issues discussed in section 2. Modi cations belonging to di erent categories are also combined together within the same ABox. 4

Generating Instance Modi cations

In this section, we describe the Modi er module of our benchmarks generation procedure, that is the way the modi ed ABoxes of benchmarks are generated. Given the reference ABox as input, and a user speci cation of all the transformations to apply on it, the Modi er module automatically produces the corresponding modi ed ABoxes. In the following, all the modi cations that can be applied on the reference ABox are presented. 4.1

Data Value Di erences

The goal of this rst category of modi cations is to simulate the di erences that can be found between instances referred to the same object at the property value level. Those include typographical errors, use of di erent standard formats to represent the same value, or a combination of both within the same value. Typographical errors. Real data are often dirty. That is mainly due to typographical errors made by humans while storing data.

In order to simulate typographical errors, we use a function that takes as input a datatype property value and produces as output a modi ed value. This kind of transformation can be applied to each datatype property value (e.g., string value, integer value, date value). The modi cations to apply on the input value are randomly chosen between the following: { Insert character. A random character (or a random number, if the property has a numerical value) is inserted in the input value at a random position. { Modify character. A random character (or a random number, if the property has a numerical value) is modi ed in the input value. { Delete character. A random character (or a random number, if the property has a numerical value) is deleted in the input value. { Exchange characters' position. The position of two adjacent characters (or two adjacent numbers, if the property has a numerical value) is exchanged in the input value.

For example, the movie title \Scarface" can be transformed in the modi ed value \Scrface", obtained deleting a random character from the original string. In addition, it is possible to specify the level of severity (i.e., low, medium or high) in applying such transformations. Anyway, the number of transformations introduced in the input value is proportional to the value's length. If the number of transformations to apply is greater than one, the corresponding value can be modi ed combining di erent transformations.

Typographical modi cations can be applied to \identifying properties", \nonidentifying properties" or both. That classi cation is based on the analysis of the percentage of null and distinct values speci ed for the selected property. In particular, properties with an high percentage of distinct values and a low percentage of null values are classi ed as the most identifying.

Of course, the total amount of modi cations applied to each modi ed ABox has to change the reference ABox in a way that it is still reasonable to consider the two ABoxes semantically equivalent. In other words, a modi ed ABox is included in the benchmark only if a human can understand that its instances are referred to the same real-world objects as the ones belonging to the reference ABox. Thus, in order to evaluate the distance between the reference ABox and each modi ed ABox, we introduce a measure that takes into account the number of modi cations applied to the same ABox, the kind of the properties (i.e., \identifying properties" or \non-identifying properties") which have been modi ed, and the level of severity of the modi cations (i.e., low, medium or high). However, this measure does not a ect the instance matching results in a deterministic way, since they depend on the weight that the tested algorithm gives to each kind of modi cation. Anyway, we assume that a modi ed ABox can be considered semantically equivalent to the reference ABox only if it changes no more than 20% of each instance description.

Use of di erent standard formats. The same data within di erent sources can be represented in di erent ways.

In order to simulate the use of di erent standards within di erent sources, we use a function that takes as input a property value which allows standard modi cations (e.g., person name) and produces as output a modi ed value, using a di erent standard format. For example, the director name \De Palma, Brian" can be transformed in the modi ed value \Brian De Palma", which is another standard format to specify a person name. 4.2

Structural Heterogeneity

Another kind of situation that is simulated in our instance matching benchmark is the comparison between instances with di erent schemas. In fact, even assuming that concept mappings are available, the same individual feature (i.e., each instance property) can be modeled in di erent ways. Moreover, di erent descriptions of the same real-world object can specify di erent subsets, eventually empty, of all the possible values for that property. Combinations of di erent transformations belonging to this class of modi cation are also applied in the benchmark.

Use of di erent levels of depth for properties representation. A rst

example of this class of heterogeneity is shown in Figure 3. The two instances Scarface

HasTitle movie 1 and movie 2 are both referred to the same lm, but the movie title property is modeled in two di erent ways. In fact, the title of movie 1 is speci ed directly through a datatype property value, while the title of movie 2 is speci ed through a reference to another individual which has a property with the same title value (i.e., \Scarface"). In particular, in the rst representation, the property HasTitle is a datatype property, while in the second one it is an object property and its value is the reference to title 1 instance. In order to simulate the comparison between instances with di erent schemas, we use a function that takes as input a datatype property and produces as output an object property with the same name. Moreover, the function creates a new attribute to the generated object property, whose value is the same as the original datatype property.

Use of di erent aggregation criteria for properties representation. In

an analogous way, the name of a person can be stored all within the same property, or it can be split into di erent properties such as, for example, Name and Surname. Figure 4 shows two di erent ways of modeling the name \Pacino, Al". In the rst representation the whole value is stored within the property actor_1 actor_2 Pacino, Al

Name

Gender M

Nickname DateOfBirth

Sonny 1940-04-25

Pacino

Surname

Name DateOfBirth

Gender

Nickname

Sonny Al

M 1940-04-25 Name, while in the second one the string is split into the two values \Pacino" and \Al", referred to the properties Name and Surname respectively. In order to simulate the comparison between properties modeled in di erent ways, we use a function that takes as input a datatype property value that can be split and produces as output two new datatype properties, each specifying a di erent part of the original value.

Missing values speci cation. A further example of structural heterogeneity is shown in Figure 5. The two instances movie 1 and movie 2 are both referred to the same lm, but the two di erent descriptions specify di erent subsets of values on the property Genre.

In order to simulate the comparison between di erent sets of values referred to the same property, we use a function that takes as input the set of values speci ed for a selected property and produces as output a subset, eventually empty, of it. This kind of transformation can be applied to each property. Moreover, if a property allows multiple values, it is possible to specify if deleting all the values of the selected property or a random number of them. Scarface

HasTitle

Genre

Year 1983 Finally, instance matching process should take into account the need to use some kind of reasoning, in order to nd out correctly instances to be compared. In fact, ontologies' individuals referring to the same entity can be instantiated in di erent ways within di erent ontologies. In the following we describe ve kinds of situations that we develop in our benchmark, that can also be combined together. Each requires some kind of reasoning. Examples of those are shown in Figure 6.

M ovie u P roduct v ?

M ovie 8p:G SubM 8p:SubG

SubG v G

Instantiation on di erent subclasses of the same superclass. This trans

formation is obtained instantiating identical individuals into di erent subclasses of the same class. For example, in our benchmark, all the movie objects are instances of class Movie in the reference ABox. Instead, in one of the modi ed ABoxes, we change the type of those individuals, making them instances of class Film. Classes Movie and Film are both subclasses of Item. In Figure 6, movie 1 is instance of Movie in the reference ABox, while it is instance of Film in the modi ed ABox. Instance matching algorithms are thus required to recognize that those two instances are referred to the same object, even if they belong to di erent concepts.

Instantiation on disjoint classes. This transformation is obtained instantiating identical individuals into disjoint classes. For example, in one of the modi ed ABoxes, we change the type of all the movie objects, making them instances of class Product. Classes Movie and Product are de ned as disjoint classes in the reference TBox. In Figure 6, movie 2 is instance of Movie in the reference ABox, while it is instance of Product in the modi ed ABox. In this case we want that tested algorithms would be able to recognize that instances belonging to disjoint classes cannot be referred to the same real-world object, even if they seem identical.

Instantiation on di erent classes of a class hierarchy explicitly de

clared. This transformation is obtained instantiating identical individuals into di erent classes on which an explicit class hierarchy is de ned. For example, an individual representing a movie can be classi ed as an instance of the general concept Movie, as it is in the reference ABox, or it can be classi ed as an instance of a more speci c subclass of it, such as Action, Biography, Comedy or Drama, depending on the value that the movie instances specify on the property Genre. In Figure 6, movie 3 is instance of Movie in the reference ABox, while it is instance of its subclass Action in the modi ed ABox, since it is an action movie. Instance matching algorithms are thus required to recognize that those two instances are referred to the same object, even if they belong to di erent concepts within the class hierarchy. This explicit class hierarchy declaration can be recognized using a RDFS reasoner.

Instantiation on di erent classes of a class hierarchy implicitly de

clared. A further modi cation that we apply in the benchmark is the instantiation of identical individuals into di erent classes on which an implicit class hierarchy is de ned. Such an implicit class hierarchy declaration can be obtained through the use of restrictions. For example, the restrictions speci ed on classes Movie and SubM in the reference TBox, implicitly declare that SubM is a subclass of Movie. In Figure 6, movie 4 is instance of Movie in the reference ABox, while it is instance of SubM in the modi ed ABox. Instance matching algorithms are thus required to recognize that those two instances are referred to the same object, even if they belong to di erent concepts which are not explicitly related. This implicit class hierarchy declaration can be recognized using a DL reasoner. Implicit values speci cation. Another use of restrictions that requires a reasoning process, is the comparison between an explicit speci ed value and an implicit speci ed one, that is using an hasValue restriction. This kind of situation is simulated in our benchmark by adding a new type for each instance of the modi ed ABox. This type is a class that (implicitly) speci es property values through an hasValue restriction. In Figure 6, in the reference ABox, movie 5 is instance of Movie and its value on the property HasTitle is \Scarface"; in the modi ed ABox, movie 5 is as well instance of Movie, but it is also instance of the restriction class that implicitly speci es the value \Scarface" for its HasTitle property. Instance matching algorithms are thus required to recognize that those two instances are referred to the same object, even if some property values of the modi ed instance are implicitly de ned. 5

Benchmark at Work

In this section, we describe how the generated benchmark is used to evaluate instance matching algorithms. Each execution of the evaluation process takes as input a couple of ABoxes, that is the reference ABox and one of the modi ed ABoxes, and produces the set of instance mappings found by the tested algorithm. The output mapping alignment is then compared with the expected one, which is given together with each modi ed ABox. That reference alignment is automatically generated by specifying a mapping for each couple of corresponding instances, that is the one belonging to the reference ABox and the one obtained by applying to it one or more of the modi cations discussed in section 4. Instance matching algorithms are evaluated according to the following parameters.

{ Precision: the number of correct retrieved mappings = the number of retrieved mappings. { Recall: the number of correct retrieved mappings = the number of expected mappings. { F-measure: 2 (precision recall) = (precision + recall). { Fall-out: the number of incorrect retrieved mappings = the number of nonexpected mappings. { Execution time: time taken by the tested algorithm to compare the two input

ABoxes. This parameter measures how well the tested algorithm scales. As an example, the results obtained by two instance matching algorithms are reported. Figure 7 shows the precision and recall evaluation of the two instance matching algorithms over the generated benchmark, distinguishing the results obtained in the three main classes of problems simulated in our benchmark (i.e., data value di erences, structural heterogeneity, logical heterogeneity) and the ones obtained executing each algorithm without using any reasoner and using a (DL) reasoner (i.e., Pellet). The results obtained comparing the reference ABox with modi ed ABoxes simulating data value di erences are higher than the ones obtained in the other categories, since string matching techniques are quite consolidated. The results obtained comparing the reference ABox with modi ed ABoxes simulating structural heterogeneity are not very high because neither the rst nor the second algorithm can manage the use of di erent aggregation criteria 1,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,10 0,00 1,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,10 0,00

Data values

differences

Structural

heterogeneity

Logical

heterogeneity

Data values

differences

Structural

heterogeneity

Logical heterogeneity

for properties representation. The results obtained comparing the reference ABox with modi ed ABoxes simulating logical heterogeneity are greatly a ected by the use of a reasoner.

Finally, in Figure 8, the overall results obtained executing the two algorithms (with reasoner) on our benchmark are reported. That test had been executed on a Pentium 4 (2.00 GHz) with 512 MB of RAM. For each pair of compared IM Algorithm Precision Recall F-measure Fall-out Execution time algorithm 1 algorithm 2 instances, the rst algorithm [ 10 ] analyzes all their property values, while the second algorithm [ 11 ] checks only the values speci ed for the \most identifying" properties. That is why the execution time of the rst algorithm is greater than the execution time of the second one. Moreover, the recall of the second algorithm is higher than the recall of the rst one due to the fact that all the modi cations applied to \non-identifying" properties are ignored. A more detailed description of the two algorithms is available in [ 10, 11 ]. 6

Concluding Remarks

In this paper, we provided a benchmark for instance matching, taking into account the main requirements that instance matching algorithms should address. A contribution of our work is not only the de nition of a speci c benchmark, but also the de nition of a semi-automatic procedure for the generation of several di erent benchmarks.

Future works include the creation of further benchmarks dealing with data belonging to di erent sources and di erent domains. In particular, we would like to create a benchmark in which data belonging to di erent sources but referred to the same real-world objects are compared. For example, it can include a mapping between movie descriptions in IMDb and Amazon. In that case, the expected alignments have to be done manually, so the benchmark dimension cannot be signi cant for a real benchmark. However, it would be interesting to compare the results obtained by the same algorithms executing that benchmark and our semi-automatically generated one, in order to evaluate the quality of our benchmark generation itself.

Another possible development would be the de nition of a set of rules that automatically choose the modi cations to apply, for each modi ed ABox, to the reference ABox.

1. Euzenat , J. , Shvaiko , P. : Ontology Matching. Springer-Verlag ( 2007 )

2. Shvaiko , P. , Euzenat , J.: A survey of schema-based matching approaches . Journal on Data Semantics (JoDS) ( 2005 )

3. Bouquet , P. , Stoermer , H. , Niederee , C. , Mana , A. : Entity name system: The backbone of an open and scalable web of data . In: Proceedings of the IEEE International Conference on Semantic Computing, ICSC 2008 , IEEE ( 2008 )

4. Fellegi , I. , Sunter , A. : A theory for record linkage . J. Am. Statistical Assoc . ( 1969 )

5. Winkler , W.: The state of record linkage and current research problems . Technical report , Statistical Research Division, U.S. Bureau of the Census, Wachington, DC ( 1999 )

6. Gu , L. , Baxter , R. , Vickers , D. , Rainsford , C. : Record linkage: Current practice and future directions . Technical report, CSIRO Mathematical and Information Sciences, Canberra , Australia ( 2003 )

7. Shvaiko , P. , Euzenat , J. , Noy , N. , Stuckenschmidt , H. , Benjamins , V. , Uschold , M.:

Proceedings of the 1st international workshop on ontology matching (om-2006) collocated with the 5th international semantic web conference (iswc-

2006 ), athens, georgia, usa, november 5 , 2006 . In: OM. Volume 225 ., CEUR-WS.org ( 2006 )

8. Hollink , L., van Assem , M. , Wang , S. , Isaac , A. , Schreiber , G.: Two variations on ontology alignment evaluation: Methodological issues . ESWC 2008 ( 2008 )

9. Isaac , A. , Matthezing , H., van der Meij, L., Schlobach , S. , Wang , S. , Zinn , C. : Putting ontology alignment in contex: Usage scenarios, deployment and evaluation in a library case . ESWC 2008 ( 2008 )

10. Castano , S. , Ferrara , A. , Montanelli , S. : Matching ontologies in open networked systems: Techniques and applications . Journal on Data Semantics (JoDS) ( 2006 )

11. Bruno , S. , Castano , S. , Ferrara , A. , Lorusso , D. , Messa , G. , Montanelli , S. : Ontology coordination tools: Version 2 . Technical Report D4 .7,

BOEMIE

Project , FP6 - 027538 , 6th EU Framework Programme ( 2007 )