Towards Data Fusion in a Multi-ontology Environment Andriy Nikolov Victoria Uren Enrico Motta a.nikolov@open.ac.uk v.s.uren@open.ac.uk e.motta@open.ac.uk Knowledge Media Institute Open University Milton Keynes, UK ABSTRACT the semantic data structure. Mappings between ontology With the growing amount of semantic data being published terms are needed to provide a uniform view over individuals on the Web the problem of finding individuals in different in two datasets and make the individuals comparable. datasets which correspond to the same entity is gaining im- portance. Given that datasets are often structured using 2.1 Ontological mismatches and different ontologies, automatic schema-matching techniques correspondence patterns have to be utilized before proceeding with data-level align- Obtaining an adequate representation of mappings which ment. In this paper we discuss how ontology schema mis- allows correct data transformation is a non-trivial problem matches influence data-level alignment based on our first due to ontology mismatches. A classification framework of experience with implementing a data fusion tool for a multi- different types of mismatches between overlapping ontolo- ontology environment. gies was given in [11]. Assuming that ontologies are repre- sented in the same language, the framework distinguishes: Categories and Subject Descriptors • Conceptualisation mismatches caused by different ways H.4.m [Information Systems]: Miscellaneous; of domain interpretation. These different ways in turn D.2 [Software]: Software Engineering may concern: – Scope, when two classes seemingly representing Keywords the same concept do not contain the same in- Data fusion, coreference resolution, linked data stances (e.g., the class PoliticalOrganization in TAP ontology includes terrorist groups, while in 1. INTRODUCTION SWETO it is meant to represent only legal organ- isations). The data integration process has to deal with two top-level problems: resolving schema-level and data-level issues. On – Model coverage and granularity, when parts of the the Web scale, semantic heterogeneity of data is inevitable, domain in one ontology are not covered in another which makes it necessary for a data coreference resolution or covered with a different level of detail (e.g., in system to use results of automatic ontology matching tech- SWETO the class Company does not have sub- niques. These techniques do not guarantee 100% accuracy classes while TAP and DBPedia 3.2 distinguish and errors produced by them may influence the quality of the between different types of companies). data fusion stage. In our previous work we developed an ar- chitecture for semantic data fusion called KnoFuss [14]. The • Explication mismatches caused by different ways the initial version of the system was designed for the enterprise conceptualisation is specified. These are further di- knowledge management scenario, in which it was assumed vided into: that schema-level issues were resolved and datasets being in- tegrated were already structured according to the same on- – Modelling style mismatches, when the same do- tology. We implemented an extension of the system, which main is modeled using different paradigms (e.g., utilizes schema-level mappings, produced automatically, to point vs interval logic for time representation) resolve coreferences between datasets using different ontolo- or concept specification (e.g., splitting the sub- gies. In this paper we discuss the impact of the ontology classes of the same class in a hierarchy according heterogeneity on the quality of instance coreferencing. to different criteria). – Terminological mismatches, when different terms 2. ONTOLOGICAL MISMATCHES AND are used to represent the same entity (synonymy) or the same term represents different entities DATA INTEGRATION ISSUES (homonymy). The situation when datasets to be integrated use different ontologies makes it hard for data integration methods to use – Encoding mismatches, when the values at the data level have different formats. This one has to be Copyright is held by the author/owner(s). dealt at the data-level stage, so we do not consider LDOW2009, April 20, 2009, Madrid, Spain. . it in this paper. Figure 2: Fusion task decomposition incorporating schema matching. Figure 1: Correspondence patterns of ontology everybody who contributed to a CS paper mentioned in the matching according to [16] (fragment). A commonly knowledge base. Thus, labels in SWETO are much more used DisjointClass pattern is included. ambiguous and the danger of matching two unrelated in- dividuals increases, which may affect precision. The same To represent correctly the correspondences between on- happens when there is no equivalence between classes but a tologies and overcome these mismatches mappings of vary- Sub-Super-Class relation: the same degree of similarity be- ing degrees of complexity are required. In [16] common cor- tween individuals may provide much weaker evidence, which respondence patterns are introduced to represent such map- makes it hard to adequately estimate the reliability of meth- pings (see Fig. 1). For the most part mapping patterns ods’ output. Another area of impact involves disjointness re- represent description logic relations. Available automatic lations. Disjointness between classes can be used as evidence ontology matching algorithms can only produce a subset of to consider some coreference mappings incorrect and delete possible mappings. Given the limited capabilities of ontol- them. Scope mismatches can lead to errors when classes con- ogy matching tools we can expect that some of the ontology sidered disjoint in one ontology are overlapping in another mismatches will remain unresolved or partially unresolved one (like in the case with PoliticalOrganization and Ter- at the data integration stage. Below we try to consider the roristOrganization above): correct mappings can be deleted impact of such mismatches during the data integration pro- if they are perceived as causing inconsistency. Granularity cess. mismatches do not allow using ontological constraints de- fined for classes at the lower levels of the hierarchy if the 2.2 Data-level impact of ontology mismatches other ontology does not distinguish between these classes. The first type of mismatches in the classification presented Among the explication mismatches modelling style differ- in [11] concerns conceptualisation. For the coreference reso- ences are the hardest to solve automatically. Translation lution stage shared conceptualisation allows the system to: between paradigms is a very domain-specific problem and common correspondence patterns are often not sufficient to • consider individuals belonging to the same class as can- align two ontologies. In a simple example case, if one ontol- didates for matching; ogy represents colours using a set of pre-defined labels (red, yellow, black) and another one uses RGB encoding, it is very • estimate the likelihood of individuals being equiva- hard to find similar values automatically: a hand-tailored lent given available evidence (e.g., having two people matching procedure is necessary. To our knowledge, no ex- with the same name belonging to a specific class Se- isting automatic ontology matching tool is capable of deal- manticWebResearcher is a much stronger evidence of ing with different paradigms. For the case when subclasses equivalence than if they only had a generic class Per- of the same class in two ontologies are split according to son in common). different criteria, no useful DL relations can be established between them (apart from the fact that there may be some Conceptualisation mismatches between two ontologies (in overlap). Such differences can make any automatic data in- particular, scope mismatches) may reduce both recall and tegration procedures intractable. If these mismatches occur precision of coreference resolution algorithms. For exam- at lower levels of the hierarchy, methods can operate only ple, the class Company in SWETO does not include finan- with information defined at a higher level. cial organisations, while its counterpart in TAP includes Finally, terminological mismatches are the primary focus them. Thus, when the system tries to find for each com- of most existing ontology matching tools [5], which makes pany in TAP coreferent individuals in SWETO only having them the simplest to handle. They can be solved by creating the equivalence relation between these classes, it will not EquivalentClass and EquivalentAttribute correspondences. find matching pairs for financial organisations, because they belong to a different class in SWETO. This will make the recall decrease. On the other hand, the class ComputerSci- 3. KNOFUSS ARCHITECTURE entist in TAP contains only world-famous computer scien- The KnoFuss architecture [14] implements a modular frame- tists while most researchers are classified according to their work for semantic data fusion. The fusion process is divided place of work (e.g., CMUPerson, W3CPerson). Computer- into subtasks as shown in the Fig. 2 and the architecture ScienceResearcher in SWETO, which automatic tools often focuses on its second stage: knowledge base integration. consider equivalent, has much wider coverage and includes The first subtask is coreference resolution: finding poten- tially coreferent instances based on their attributes. The cluded EquivalentClass mappings with classes tap: CMU- next stage, knowledge base updating, refines coreferencing Person, tap:ComputerScientist and tap:MedicalScientist. results taking into account ontological constraints, data con- Such a variety of potentially corresponding classes is caused flicts and links between individuals. Algorithms performing by several existing mismatches between ontologies, in par- fusion subtasks (e.g., string-based similarity matchers) are ticular terminological mismatches (Computer Science Re- represented as problem-solving methods. All methods for searcher vs ComputerScientist), modelling style mismatches the same task have a common interface and their capabil- (tap: CMUPerson includes computer science researchers who ities (range of applicability and reliability of output) are worked in the CMU) and conceptualisation scope mismatches formally defined using the fusion ontology. Because each al- (tap: ComputerScientist represents only a subset of “world- gorithm behaves differently depending on the data to which famous” researchers and tap:Medical-Scientist includes au- it is applied, optimal parameters can be defined depending thors of medical AI expert systems). From the strict logical on the application context (type of data): e.g., Jaro-Winkler point of view the only correct mapping would be a Sub- string similarity is appropriate for comparing person names Super-Class mapping tap:ComputerScientist ⊆ sweto: Com- but not suitable for publication titles, etc. puter Science Researcher. However, excluding other map- To deal with the multi-ontology scenario the architecture pings would remove from consideration many TAP individ- has to cover the ontology integration stage, which includes uals, which have their equivalent SWETO counterparts. In two subtasks: ontology matching and instance transforma- reality, the data integration system needs information about tion. partial alignments between concepts to select individuals which may potentially be coreferent rather than strict logical 3.1 Ontology matching relations. We can call this the OverlapClass correspondence The Ontology matching task involves creation of mapping pattern. Thus, the query from our example is translated rules or alignments: sets of correspondences between two into: ontologies [5]. SELECT ?uri WHERE Considering correspondence patterns, data fusion needs { {?uri rdf:type tap:CMUPerson} both correspondences between concepts (ClassCorrespon- UNION {?uri rdf:type tap:Computer Scientist} dence) and correspondences between properties (Attribute- UNION {?uri rdf:type tap:Medical Scientist}} Correspondence). Class mappings allow relevant method These pairs of queries assumed to be equivalent are then application contexts to be translated into the terms of the used at the later stages of the workflow, which allows the source ontology, if they were initially defined in terms of the system to operate in the same way as in a single ontology target ontology. Attribute correspondences are needed in case. At this stage the system utilizes the DisjointClass order to retrieve properties relevant for coreference resolu- mappings. The system uses a simple algorithm to search tion in both knowledge bases. Equivalence and subsumption for contradictory mappings: it finds situations when two relations allow relevant data structures in the source ontol- classes in different ontologies are connected via a Sub-Super- ogy to be found. Disjointness relations between concepts Class mapping (created by ontology matching methods or are usable for the Knowledge base updating stage, providing inferred) and at the same time are disjoint (again, directly evidence for inconsistency resolution. The architecture as- or via inference). Such mappings are considered conflicting. sumes that ontology matching methods provide their output If the DisjointClass mapping has higher confidence then the in the standard Alignment API format [4]. contradictory Sub-Super-Class mapping (or the mapping it was inferred from) is removed from consideration. 3.2 Instance transformation The goal of the Instance transformation stage is to resolve 4. EXPERIMENTS structural differences between two knowledge bases so that To test the KnoFuss architecture in a multi-ontology sce- the architecture itself and instance-level methods can pro- nario we used two artificially created knowledge bases in- cess individuals in the source and target knowledge bases in tended to be used as benchmarks for Semantic Web ap- the same way. Alignments produced by ontology match- plications: TAP [9] and SWETO testbed [1]. As primary ing methods are applied to provide a uniform view over methods for ontology matching we used two tools, which data in two knowledge bases. In the KnoFuss architecture participated in the last OAEI contest: CIDER [8] and Lily SPARQL queries are used as a primary means of retriev- [18]. Also we used the SCARLET service [15] as a method ing data (method applicability ranges, application contexts, for generating DisjointClass mappings using existing ontolo- sets of relevant attributes). These queries are translated into gies defined elsewhere on the Web. Assuming that all sib- the terms of the source ontology using available mappings. ling classes in the target ontology (SWETO) were mutually Sometimes a term in the target ontology potentially corre- disjoint and using equivalence mappings produced by the sponds to several terms in the source ontology. This happens CIDER tool we inferred additional disjointness mappings. when there are several candidate EquivalentClass mappings Disjointness mappings were used to filter out conflicting provided by one or several ontology matching tools. In such equivalence relations with a low reliability. As coreference situations we combine these mappings and consider them as resolution methods for instances we used the same string a single ClassUnion mapping. For instance when we con- similarity techniques as in our single-ontology scenario ex- sider the query periments [14]. While our experiments are still ongoing, SELECT ?uri WHERE { from these tests we could make several observations. ?uri rdf:type sweto:Computer Science Researcher } First, as could be expected, errors during schema match- the system tries to find all ClassCorrespondence mappings, ing stage are propagated and can potentially lead to signifi- which include the class sweto:Computer Science Researcher. cant distortions during instance coreferencing. For instance, In our example with the CIDER tool (see below) these in- when matching instances of the class sweto:Company the CIDER tool incorrectly aligned it with the class tap:Country. on our experience, we can outline several directions for as- This led the coreference precision to drop to 41% while it sisting data fusion in the presence of schema heterogeneity. reached 74% without this mistake (many companies have First, label comparison is usually not considered suffi- names derived from country names). We found ontological ciently reliable evidence for coreference resolution (e.g., [7]). constraints to be extremely valuable as a means to repair However, more complex algorithms utilizing context data such errors. Apart from the widely used owl:Functional- (additional properties and links between individuals) can Property and owl:InverseFunctionalProperty, which allow only be applied to datasets containing sufficiently overlap- non-ambiguous instance identification, ontological axioms, ping data. It can be expected that many data integration which may lead to inconsistency, allow filtering out incor- tasks on the Web scale will only be able to rely on in- rect mappings. These constraints include disjointness and stance names and thus can only provide suggestions rather datatype properties with cardinality constraints. E.g., know- than generate owl:sameAs statements carrying strong im- ing that Company is disjoint with Country (or inferring plications. Given that the output is likely to be noisy it is that) would repair the problem. However, most ontologies necessary to keep track of data integration decisions (such do not define these constraints explicitly because they are as instance coreference mappings or statements considered not needed in common ontology usage scenarios. incorrect) and their provenance. One possible way is to ex- Second, although semantic heterogeneity (different mean- tend the coreference bundles approach [10] to include for ing attached to similar resources) is seen primarily as a each URI the confidence of its inclusion into the set. schema-level knowledge modelling issue, it can cause prob- Second, considering the limited capabilities of automatic lems at the instance level as well. For instance, the TAP on- ontology matching methods, availability of trusted reusable tology contains a single individual describing the Coca-Cola schema-level background knowledge is important. Such man- Company while SWETO contains several individuals de- ually built reference knowledge is useful when it covers the scribing Coca-Cola branches in different countries. Whether gaps existing in common ontology matching scenarios. such instances should be considered coreferent depends on Among others, such reference knowledge may include: the context of the task. Then, as for the single-ontology scenario, it is hard to find • Specifying rich semantic restrictions existing in a cer- a single instance matching algorithm to apply to all kinds tain domain, e.g., disjointness relations, property car- of data: settings have to be optimized for a specific type dinality and domain/range constraints. of data rather than for a specific pair of ontologies as in schema matching. Ontology mismatches may lead not just • Covering common ontological mismatches, which can- to irrelevant instances being compared, but also to instances not be resolved automatically. For instance, these can being compared using inappropriate similarity measures. include transformation rules between common time modelling approaches and overlaps between subclasses of the same concept divided according to different cri- 5. DISCUSSION teria (e.g., classifying historical artifacts from China As we said in the beginning, our primary interest when by centuries or by dynastic periods). In this way a implementing the version of the KnoFuss architecture to be complex modelling style mismatch can be reduced to used in a multi-ontology scenario was to observe the in- a terminological one, which can be treated automati- fluence of schema-level mismatches on the data integration cally. stage. In comparison with the single-ontology data fusion sce- Third, sometimes existing automatic matching tools im- nario, adding the ontology heterogeneity challenge results pose too rigid restrictions on their output aimed at improv- both in decreased reliability of methods’ output and diffi- ing the precision. For instance, some tools (like Lily) pro- culties in precise estimation of this decrease. For data-level duce only one-to-one equivalence mappings assuming that coreference resolution methods we assume that the perfor- two different classes in one ontology cannot be considered mance of the method depends on some common features of equivalent to the same class in another ontology. Thus, only individuals belonging to a class: this assumption was the the best candidate for equivalence is selected and all oth- basis for the usage of application contexts in the KnoFuss ers are filtered out. While a useful assumption for termi- architecture. For ontology matching methods even knowing nological mismatches, it may miss important mappings in the estimated quality of a method (e.g., precision/recall in the presence of conceptualisation and modelling style mis- some test scenarios) it is hard to estimate whether it will matches. From the data fusion point of view it would be hold for a different pair of datasets. Second, it is hard to useful if ontology matching algorithms could produce weak measure precisely the impact of a single ontology-level error mapping relations such as ClassOverlap. at the data level. This possible negative impact can result in: 6. RELATED WORK • Erroneous widening or narrowing of the applicability Given the amount of data, which needs to be handled on range of integration methods (misaligned concepts). the Web scale, the need to use automatic coreference reso- lution techniques is recognized in the Semantic Web com- • Providing noisy evidence for data-level methods (mis- munity [2], [7], [6]. Among the existing systems Sindice aligned properties and ontological restrictions). [17] uses a straightforward method for coreference resolu- tion by utilizing explicitly defined key properties (inverse Finally, some ontological mismatches, such as modelling style, functional properties). Individuals, which have equal val- cannot be resolved fully automatically by currently existing ues for such properties are considered equivalent. This is tools and can make data-level methods inapplicable. Based an approach which provides high precision but can only be applied to a limited subset of data, where such prop- [3] P. Bouquet, H. Stoermer, and D. Giacomuzzi. erties are defined explicitly and have values in a standard OKKAM: Enabling a web of entities. In WWW2007 format. Other tools implement approximate matching tech- Workshop i3: Identity, Identifiers and Identification, niques similar to those created in the database integration Banff, Canada, 2007. and ontology matching domains. The OKKAM server [3] [4] J. Euzenat. An API for ontology alignment. In 3rd used the Monge-Elkan string similarity metrics for select- International Semantic Web Conference, volume 3298 ing coreferent instances in the experiments. RDF-AI [12] of Lecture Notes in Computer Science, pages 698–712, concentrates on data-level issues when combining datasets Hiroshima, Japan, 2004. Springer. using the same schema. The algorithm uses string (Monge- [5] J. Euzenat and P. Shvaiko. Ontology matching. Elkan) and linguistic (WordNet) similarity to calculate dis- Springer-Verlag, Heidelberg, 2007. tance between literal property values and then uses the itera- [6] A. Ferrara, D. Lorusso, and S. Montanelli. Automatic tive graph matching algorithm, similar to similarity flooding identity recognition in the Semantic Web. In [13], to calculate distance between individuals. Workshop on Identity and Reference on the Semantic Web, ESWC 2008, Tenerife, Spain, 2008. 7. SUMMARY AND FUTURE WORK [7] H. Glaser, I. Millard, A. Jaffri, T. Lewy, and We implemented the first prototype of the KnoFuss data B. Dowling. On coreference and the Semantic Web. In integration system for the multi-ontology environment and 7th International Semantic Web Conference (ISWC performed initial experiments with it. In our view, combin- 2008) (submitted), Karlsruhe, Germany, 2008. ing automatic schema-level and data-level alignment tech- [8] J. Gracia and E. Mena. Matching with CIDER: niques in a single workflow still presents difficulties not only Evaluation report for the OAEI 2008. In 3rd Ontology because schema-level matching tools occasionally produces Matching Workshop (OM’08) at the 7th International errors, but also because some important types of ontology Semantic Web Conference (ISWC’08), Karlsruhe, mismatches are not handled properly by them. In partic- Germany, 2008. ular, this concerns conceptualisation and modelling style [9] R. V. Guha and R. McCool. TAP: a Semantic Web mismatches. While being very hard to solve automatically, platform. Computer Networks, 42(5):557–577, 2003. there are several ways to assist the coreference resolution [10] A. Jaffri, H. Glaser, and I. Millard. Managing URI process when dealing with these mismatches, in particular: synonymity to enable consistent reference on the • Extend the functionality of automatic schema-matching Semantic Web. In Workshop on Identity and tools to discover different types of mappings such as Reference on the Semantic Web (IRSW2008), DisjointClass and OverlapClass. Tenerife, Spain, 2008. [11] M. Klein. Combining and relating ontologies: an • Develop and publish reference ontologies explicitly defin- analysis of problems and solutions. In Workshop on ing common relations between concepts and proper- Ontologies and Information Sharing, 2001. ties, which remain neglected in existing ontologies, in- [12] Y. Liu, F. Scharffe, and C. Zhou. Towards practical cluding disjointness relations and translation rules be- rdf datasets fusion. In Workshop on Data Integration tween common modelling paradigms. through Semantic Technology (DIST2008), ASWC • Maintain provenance and estimated reliability of auto- 2008, Bangkok, Thailand, 2008. matically produced instance-level mappings so that an [13] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity agent can make a decision about whether to use them flooding: A versatile graph matching algorithm. In or not. 18th International Conference on Data Engineering As the top priorities for the future work currently we are (ICDE), pages 117–128, San Jose (CA US), 2002. considering the following: [14] A. Nikolov, V. Uren, E. Motta, and A. de Roeck. Integration of semantically annotated data by the • Continue more experimental testing with public linked KnoFuss architecture. In 16th International data sources using detailed ontologies (such as DBPe- Conference on Knowledge Engineering and Knowledge dia 3.2). Management (EKAW 2008), Acitrezza, Italy, 2008. • Develop a data fusion service, which can operate on the [15] M. Sabou, M. d’Aquin, and E. Motta. Exploring the Semantic Web in conjunction with existing linked data Semantic Web as background knowledge for ontology sources and semantic applications (such as WATSON, matching. Journal of Data Semantics, 2008. SCARLET, Alignment Server). [16] F. Scharffe and D. Fensel. Correspondence patterns for ontology alignment. In 16th International 8. REFERENCES Conference on Knowledge Engineering and Knowledge [1] B. Aleman-Meza, C. Halaschek, A. Sheth, I. B. Management (EKAW 2008), pages 83–92, Acitrezza, Arpinar, and G. Sannapareddy. SWETO: Large-scale Italy, 2008. Semantic Web test-bed. In Workshop on Ontology in [17] G. Tummarello, R. Delbru, and E. Oren. Sindice.com: Action, 16th International Conference on Software Weaving the open linked data. In 6th International Engineering and Knowledge Engineering (SEKE2004), Semantic Web Conference (ISWC/ASWC 2007), pages 21–24, 2004. pages 552–565, 2007. [2] P. Bouquet, H. Stoermer, and B. Bazzanella. An [18] P. Wang and B. Xu. Lily: Ontology alignment results Entity Name System (ENS) for the Semantic Web. In for OAEI 2008. In 3rd Ontology Matching Workshop 5th Annual European Semantic Web Conference (OM’08) at the 7th International Semantic Web (ESWC 2008), pages 258–272, 2008. Conference (ISWC’08), Karlsruhe, Germany, 2008.