=Paper=
{{Paper
|id=Vol-538/paper-15
|storemode=property
|title=Towards Data Fusion in a Multi-ontology Environment
|pdfUrl=https://ceur-ws.org/Vol-538/ldow2009_paper15.pdf
|volume=Vol-538
|dblpUrl=https://dblp.org/rec/conf/www/NikolovUM09
}}
==Towards Data Fusion in a Multi-ontology Environment==
Towards Data Fusion in a Multi-ontology Environment
Andriy Nikolov Victoria Uren Enrico Motta
a.nikolov@open.ac.uk v.s.uren@open.ac.uk e.motta@open.ac.uk
Knowledge Media Institute
Open University
Milton Keynes, UK
ABSTRACT the semantic data structure. Mappings between ontology
With the growing amount of semantic data being published terms are needed to provide a uniform view over individuals
on the Web the problem of finding individuals in different in two datasets and make the individuals comparable.
datasets which correspond to the same entity is gaining im-
portance. Given that datasets are often structured using
2.1 Ontological mismatches and
different ontologies, automatic schema-matching techniques
correspondence patterns
have to be utilized before proceeding with data-level align- Obtaining an adequate representation of mappings which
ment. In this paper we discuss how ontology schema mis- allows correct data transformation is a non-trivial problem
matches influence data-level alignment based on our first due to ontology mismatches. A classification framework of
experience with implementing a data fusion tool for a multi- different types of mismatches between overlapping ontolo-
ontology environment. gies was given in [11]. Assuming that ontologies are repre-
sented in the same language, the framework distinguishes:
Categories and Subject Descriptors • Conceptualisation mismatches caused by different ways
H.4.m [Information Systems]: Miscellaneous; of domain interpretation. These different ways in turn
D.2 [Software]: Software Engineering may concern:
– Scope, when two classes seemingly representing
Keywords the same concept do not contain the same in-
Data fusion, coreference resolution, linked data stances (e.g., the class PoliticalOrganization in
TAP ontology includes terrorist groups, while in
1. INTRODUCTION SWETO it is meant to represent only legal organ-
isations).
The data integration process has to deal with two top-level
problems: resolving schema-level and data-level issues. On – Model coverage and granularity, when parts of the
the Web scale, semantic heterogeneity of data is inevitable, domain in one ontology are not covered in another
which makes it necessary for a data coreference resolution or covered with a different level of detail (e.g., in
system to use results of automatic ontology matching tech- SWETO the class Company does not have sub-
niques. These techniques do not guarantee 100% accuracy classes while TAP and DBPedia 3.2 distinguish
and errors produced by them may influence the quality of the between different types of companies).
data fusion stage. In our previous work we developed an ar-
chitecture for semantic data fusion called KnoFuss [14]. The • Explication mismatches caused by different ways the
initial version of the system was designed for the enterprise conceptualisation is specified. These are further di-
knowledge management scenario, in which it was assumed vided into:
that schema-level issues were resolved and datasets being in-
tegrated were already structured according to the same on- – Modelling style mismatches, when the same do-
tology. We implemented an extension of the system, which main is modeled using different paradigms (e.g.,
utilizes schema-level mappings, produced automatically, to point vs interval logic for time representation)
resolve coreferences between datasets using different ontolo- or concept specification (e.g., splitting the sub-
gies. In this paper we discuss the impact of the ontology classes of the same class in a hierarchy according
heterogeneity on the quality of instance coreferencing. to different criteria).
– Terminological mismatches, when different terms
2. ONTOLOGICAL MISMATCHES AND are used to represent the same entity (synonymy)
or the same term represents different entities
DATA INTEGRATION ISSUES (homonymy).
The situation when datasets to be integrated use different
ontologies makes it hard for data integration methods to use – Encoding mismatches, when the values at the data
level have different formats. This one has to be
Copyright is held by the author/owner(s). dealt at the data-level stage, so we do not consider
LDOW2009, April 20, 2009, Madrid, Spain.
. it in this paper.
Figure 2: Fusion task decomposition incorporating
schema matching.
Figure 1: Correspondence patterns of ontology
everybody who contributed to a CS paper mentioned in the
matching according to [16] (fragment). A commonly
knowledge base. Thus, labels in SWETO are much more
used DisjointClass pattern is included.
ambiguous and the danger of matching two unrelated in-
dividuals increases, which may affect precision. The same
To represent correctly the correspondences between on- happens when there is no equivalence between classes but a
tologies and overcome these mismatches mappings of vary- Sub-Super-Class relation: the same degree of similarity be-
ing degrees of complexity are required. In [16] common cor- tween individuals may provide much weaker evidence, which
respondence patterns are introduced to represent such map- makes it hard to adequately estimate the reliability of meth-
pings (see Fig. 1). For the most part mapping patterns ods’ output. Another area of impact involves disjointness re-
represent description logic relations. Available automatic lations. Disjointness between classes can be used as evidence
ontology matching algorithms can only produce a subset of to consider some coreference mappings incorrect and delete
possible mappings. Given the limited capabilities of ontol- them. Scope mismatches can lead to errors when classes con-
ogy matching tools we can expect that some of the ontology sidered disjoint in one ontology are overlapping in another
mismatches will remain unresolved or partially unresolved one (like in the case with PoliticalOrganization and Ter-
at the data integration stage. Below we try to consider the roristOrganization above): correct mappings can be deleted
impact of such mismatches during the data integration pro- if they are perceived as causing inconsistency. Granularity
cess. mismatches do not allow using ontological constraints de-
fined for classes at the lower levels of the hierarchy if the
2.2 Data-level impact of ontology mismatches other ontology does not distinguish between these classes.
The first type of mismatches in the classification presented Among the explication mismatches modelling style differ-
in [11] concerns conceptualisation. For the coreference reso- ences are the hardest to solve automatically. Translation
lution stage shared conceptualisation allows the system to: between paradigms is a very domain-specific problem and
common correspondence patterns are often not sufficient to
• consider individuals belonging to the same class as can- align two ontologies. In a simple example case, if one ontol-
didates for matching; ogy represents colours using a set of pre-defined labels (red,
yellow, black) and another one uses RGB encoding, it is very
• estimate the likelihood of individuals being equiva- hard to find similar values automatically: a hand-tailored
lent given available evidence (e.g., having two people matching procedure is necessary. To our knowledge, no ex-
with the same name belonging to a specific class Se- isting automatic ontology matching tool is capable of deal-
manticWebResearcher is a much stronger evidence of ing with different paradigms. For the case when subclasses
equivalence than if they only had a generic class Per- of the same class in two ontologies are split according to
son in common). different criteria, no useful DL relations can be established
between them (apart from the fact that there may be some
Conceptualisation mismatches between two ontologies (in overlap). Such differences can make any automatic data in-
particular, scope mismatches) may reduce both recall and tegration procedures intractable. If these mismatches occur
precision of coreference resolution algorithms. For exam- at lower levels of the hierarchy, methods can operate only
ple, the class Company in SWETO does not include finan- with information defined at a higher level.
cial organisations, while its counterpart in TAP includes Finally, terminological mismatches are the primary focus
them. Thus, when the system tries to find for each com- of most existing ontology matching tools [5], which makes
pany in TAP coreferent individuals in SWETO only having them the simplest to handle. They can be solved by creating
the equivalence relation between these classes, it will not EquivalentClass and EquivalentAttribute correspondences.
find matching pairs for financial organisations, because they
belong to a different class in SWETO. This will make the
recall decrease. On the other hand, the class ComputerSci- 3. KNOFUSS ARCHITECTURE
entist in TAP contains only world-famous computer scien- The KnoFuss architecture [14] implements a modular frame-
tists while most researchers are classified according to their work for semantic data fusion. The fusion process is divided
place of work (e.g., CMUPerson, W3CPerson). Computer- into subtasks as shown in the Fig. 2 and the architecture
ScienceResearcher in SWETO, which automatic tools often focuses on its second stage: knowledge base integration.
consider equivalent, has much wider coverage and includes The first subtask is coreference resolution: finding poten-
tially coreferent instances based on their attributes. The cluded EquivalentClass mappings with classes tap: CMU-
next stage, knowledge base updating, refines coreferencing Person, tap:ComputerScientist and tap:MedicalScientist.
results taking into account ontological constraints, data con- Such a variety of potentially corresponding classes is caused
flicts and links between individuals. Algorithms performing by several existing mismatches between ontologies, in par-
fusion subtasks (e.g., string-based similarity matchers) are ticular terminological mismatches (Computer Science Re-
represented as problem-solving methods. All methods for searcher vs ComputerScientist), modelling style mismatches
the same task have a common interface and their capabil- (tap: CMUPerson includes computer science researchers who
ities (range of applicability and reliability of output) are worked in the CMU) and conceptualisation scope mismatches
formally defined using the fusion ontology. Because each al- (tap: ComputerScientist represents only a subset of “world-
gorithm behaves differently depending on the data to which famous” researchers and tap:Medical-Scientist includes au-
it is applied, optimal parameters can be defined depending thors of medical AI expert systems). From the strict logical
on the application context (type of data): e.g., Jaro-Winkler point of view the only correct mapping would be a Sub-
string similarity is appropriate for comparing person names Super-Class mapping tap:ComputerScientist ⊆ sweto: Com-
but not suitable for publication titles, etc. puter Science Researcher. However, excluding other map-
To deal with the multi-ontology scenario the architecture pings would remove from consideration many TAP individ-
has to cover the ontology integration stage, which includes uals, which have their equivalent SWETO counterparts. In
two subtasks: ontology matching and instance transforma- reality, the data integration system needs information about
tion. partial alignments between concepts to select individuals
which may potentially be coreferent rather than strict logical
3.1 Ontology matching relations. We can call this the OverlapClass correspondence
The Ontology matching task involves creation of mapping pattern. Thus, the query from our example is translated
rules or alignments: sets of correspondences between two into:
ontologies [5]. SELECT ?uri WHERE
Considering correspondence patterns, data fusion needs { {?uri rdf:type tap:CMUPerson}
both correspondences between concepts (ClassCorrespon- UNION {?uri rdf:type tap:Computer Scientist}
dence) and correspondences between properties (Attribute- UNION {?uri rdf:type tap:Medical Scientist}}
Correspondence). Class mappings allow relevant method These pairs of queries assumed to be equivalent are then
application contexts to be translated into the terms of the used at the later stages of the workflow, which allows the
source ontology, if they were initially defined in terms of the system to operate in the same way as in a single ontology
target ontology. Attribute correspondences are needed in case. At this stage the system utilizes the DisjointClass
order to retrieve properties relevant for coreference resolu- mappings. The system uses a simple algorithm to search
tion in both knowledge bases. Equivalence and subsumption for contradictory mappings: it finds situations when two
relations allow relevant data structures in the source ontol- classes in different ontologies are connected via a Sub-Super-
ogy to be found. Disjointness relations between concepts Class mapping (created by ontology matching methods or
are usable for the Knowledge base updating stage, providing inferred) and at the same time are disjoint (again, directly
evidence for inconsistency resolution. The architecture as- or via inference). Such mappings are considered conflicting.
sumes that ontology matching methods provide their output If the DisjointClass mapping has higher confidence then the
in the standard Alignment API format [4]. contradictory Sub-Super-Class mapping (or the mapping it
was inferred from) is removed from consideration.
3.2 Instance transformation
The goal of the Instance transformation stage is to resolve 4. EXPERIMENTS
structural differences between two knowledge bases so that To test the KnoFuss architecture in a multi-ontology sce-
the architecture itself and instance-level methods can pro- nario we used two artificially created knowledge bases in-
cess individuals in the source and target knowledge bases in tended to be used as benchmarks for Semantic Web ap-
the same way. Alignments produced by ontology match- plications: TAP [9] and SWETO testbed [1]. As primary
ing methods are applied to provide a uniform view over methods for ontology matching we used two tools, which
data in two knowledge bases. In the KnoFuss architecture participated in the last OAEI contest: CIDER [8] and Lily
SPARQL queries are used as a primary means of retriev- [18]. Also we used the SCARLET service [15] as a method
ing data (method applicability ranges, application contexts, for generating DisjointClass mappings using existing ontolo-
sets of relevant attributes). These queries are translated into gies defined elsewhere on the Web. Assuming that all sib-
the terms of the source ontology using available mappings. ling classes in the target ontology (SWETO) were mutually
Sometimes a term in the target ontology potentially corre- disjoint and using equivalence mappings produced by the
sponds to several terms in the source ontology. This happens CIDER tool we inferred additional disjointness mappings.
when there are several candidate EquivalentClass mappings Disjointness mappings were used to filter out conflicting
provided by one or several ontology matching tools. In such equivalence relations with a low reliability. As coreference
situations we combine these mappings and consider them as resolution methods for instances we used the same string
a single ClassUnion mapping. For instance when we con- similarity techniques as in our single-ontology scenario ex-
sider the query periments [14]. While our experiments are still ongoing,
SELECT ?uri WHERE { from these tests we could make several observations.
?uri rdf:type sweto:Computer Science Researcher } First, as could be expected, errors during schema match-
the system tries to find all ClassCorrespondence mappings, ing stage are propagated and can potentially lead to signifi-
which include the class sweto:Computer Science Researcher. cant distortions during instance coreferencing. For instance,
In our example with the CIDER tool (see below) these in- when matching instances of the class sweto:Company the
CIDER tool incorrectly aligned it with the class tap:Country. on our experience, we can outline several directions for as-
This led the coreference precision to drop to 41% while it sisting data fusion in the presence of schema heterogeneity.
reached 74% without this mistake (many companies have First, label comparison is usually not considered suffi-
names derived from country names). We found ontological ciently reliable evidence for coreference resolution (e.g., [7]).
constraints to be extremely valuable as a means to repair However, more complex algorithms utilizing context data
such errors. Apart from the widely used owl:Functional- (additional properties and links between individuals) can
Property and owl:InverseFunctionalProperty, which allow only be applied to datasets containing sufficiently overlap-
non-ambiguous instance identification, ontological axioms, ping data. It can be expected that many data integration
which may lead to inconsistency, allow filtering out incor- tasks on the Web scale will only be able to rely on in-
rect mappings. These constraints include disjointness and stance names and thus can only provide suggestions rather
datatype properties with cardinality constraints. E.g., know- than generate owl:sameAs statements carrying strong im-
ing that Company is disjoint with Country (or inferring plications. Given that the output is likely to be noisy it is
that) would repair the problem. However, most ontologies necessary to keep track of data integration decisions (such
do not define these constraints explicitly because they are as instance coreference mappings or statements considered
not needed in common ontology usage scenarios. incorrect) and their provenance. One possible way is to ex-
Second, although semantic heterogeneity (different mean- tend the coreference bundles approach [10] to include for
ing attached to similar resources) is seen primarily as a each URI the confidence of its inclusion into the set.
schema-level knowledge modelling issue, it can cause prob- Second, considering the limited capabilities of automatic
lems at the instance level as well. For instance, the TAP on- ontology matching methods, availability of trusted reusable
tology contains a single individual describing the Coca-Cola schema-level background knowledge is important. Such man-
Company while SWETO contains several individuals de- ually built reference knowledge is useful when it covers the
scribing Coca-Cola branches in different countries. Whether gaps existing in common ontology matching scenarios.
such instances should be considered coreferent depends on Among others, such reference knowledge may include:
the context of the task.
Then, as for the single-ontology scenario, it is hard to find • Specifying rich semantic restrictions existing in a cer-
a single instance matching algorithm to apply to all kinds tain domain, e.g., disjointness relations, property car-
of data: settings have to be optimized for a specific type dinality and domain/range constraints.
of data rather than for a specific pair of ontologies as in
schema matching. Ontology mismatches may lead not just • Covering common ontological mismatches, which can-
to irrelevant instances being compared, but also to instances not be resolved automatically. For instance, these can
being compared using inappropriate similarity measures. include transformation rules between common time
modelling approaches and overlaps between subclasses
of the same concept divided according to different cri-
5. DISCUSSION teria (e.g., classifying historical artifacts from China
As we said in the beginning, our primary interest when by centuries or by dynastic periods). In this way a
implementing the version of the KnoFuss architecture to be complex modelling style mismatch can be reduced to
used in a multi-ontology scenario was to observe the in- a terminological one, which can be treated automati-
fluence of schema-level mismatches on the data integration cally.
stage.
In comparison with the single-ontology data fusion sce- Third, sometimes existing automatic matching tools im-
nario, adding the ontology heterogeneity challenge results pose too rigid restrictions on their output aimed at improv-
both in decreased reliability of methods’ output and diffi- ing the precision. For instance, some tools (like Lily) pro-
culties in precise estimation of this decrease. For data-level duce only one-to-one equivalence mappings assuming that
coreference resolution methods we assume that the perfor- two different classes in one ontology cannot be considered
mance of the method depends on some common features of equivalent to the same class in another ontology. Thus, only
individuals belonging to a class: this assumption was the the best candidate for equivalence is selected and all oth-
basis for the usage of application contexts in the KnoFuss ers are filtered out. While a useful assumption for termi-
architecture. For ontology matching methods even knowing nological mismatches, it may miss important mappings in
the estimated quality of a method (e.g., precision/recall in the presence of conceptualisation and modelling style mis-
some test scenarios) it is hard to estimate whether it will matches. From the data fusion point of view it would be
hold for a different pair of datasets. Second, it is hard to useful if ontology matching algorithms could produce weak
measure precisely the impact of a single ontology-level error mapping relations such as ClassOverlap.
at the data level. This possible negative impact can result
in: 6. RELATED WORK
• Erroneous widening or narrowing of the applicability Given the amount of data, which needs to be handled on
range of integration methods (misaligned concepts). the Web scale, the need to use automatic coreference reso-
lution techniques is recognized in the Semantic Web com-
• Providing noisy evidence for data-level methods (mis- munity [2], [7], [6]. Among the existing systems Sindice
aligned properties and ontological restrictions). [17] uses a straightforward method for coreference resolu-
tion by utilizing explicitly defined key properties (inverse
Finally, some ontological mismatches, such as modelling style, functional properties). Individuals, which have equal val-
cannot be resolved fully automatically by currently existing ues for such properties are considered equivalent. This is
tools and can make data-level methods inapplicable. Based an approach which provides high precision but can only
be applied to a limited subset of data, where such prop- [3] P. Bouquet, H. Stoermer, and D. Giacomuzzi.
erties are defined explicitly and have values in a standard OKKAM: Enabling a web of entities. In WWW2007
format. Other tools implement approximate matching tech- Workshop i3: Identity, Identifiers and Identification,
niques similar to those created in the database integration Banff, Canada, 2007.
and ontology matching domains. The OKKAM server [3] [4] J. Euzenat. An API for ontology alignment. In 3rd
used the Monge-Elkan string similarity metrics for select- International Semantic Web Conference, volume 3298
ing coreferent instances in the experiments. RDF-AI [12] of Lecture Notes in Computer Science, pages 698–712,
concentrates on data-level issues when combining datasets Hiroshima, Japan, 2004. Springer.
using the same schema. The algorithm uses string (Monge- [5] J. Euzenat and P. Shvaiko. Ontology matching.
Elkan) and linguistic (WordNet) similarity to calculate dis- Springer-Verlag, Heidelberg, 2007.
tance between literal property values and then uses the itera- [6] A. Ferrara, D. Lorusso, and S. Montanelli. Automatic
tive graph matching algorithm, similar to similarity flooding identity recognition in the Semantic Web. In
[13], to calculate distance between individuals. Workshop on Identity and Reference on the Semantic
Web, ESWC 2008, Tenerife, Spain, 2008.
7. SUMMARY AND FUTURE WORK [7] H. Glaser, I. Millard, A. Jaffri, T. Lewy, and
We implemented the first prototype of the KnoFuss data B. Dowling. On coreference and the Semantic Web. In
integration system for the multi-ontology environment and 7th International Semantic Web Conference (ISWC
performed initial experiments with it. In our view, combin- 2008) (submitted), Karlsruhe, Germany, 2008.
ing automatic schema-level and data-level alignment tech- [8] J. Gracia and E. Mena. Matching with CIDER:
niques in a single workflow still presents difficulties not only Evaluation report for the OAEI 2008. In 3rd Ontology
because schema-level matching tools occasionally produces Matching Workshop (OM’08) at the 7th International
errors, but also because some important types of ontology Semantic Web Conference (ISWC’08), Karlsruhe,
mismatches are not handled properly by them. In partic- Germany, 2008.
ular, this concerns conceptualisation and modelling style [9] R. V. Guha and R. McCool. TAP: a Semantic Web
mismatches. While being very hard to solve automatically, platform. Computer Networks, 42(5):557–577, 2003.
there are several ways to assist the coreference resolution [10] A. Jaffri, H. Glaser, and I. Millard. Managing URI
process when dealing with these mismatches, in particular: synonymity to enable consistent reference on the
• Extend the functionality of automatic schema-matching Semantic Web. In Workshop on Identity and
tools to discover different types of mappings such as Reference on the Semantic Web (IRSW2008),
DisjointClass and OverlapClass. Tenerife, Spain, 2008.
[11] M. Klein. Combining and relating ontologies: an
• Develop and publish reference ontologies explicitly defin- analysis of problems and solutions. In Workshop on
ing common relations between concepts and proper- Ontologies and Information Sharing, 2001.
ties, which remain neglected in existing ontologies, in-
[12] Y. Liu, F. Scharffe, and C. Zhou. Towards practical
cluding disjointness relations and translation rules be-
rdf datasets fusion. In Workshop on Data Integration
tween common modelling paradigms.
through Semantic Technology (DIST2008), ASWC
• Maintain provenance and estimated reliability of auto- 2008, Bangkok, Thailand, 2008.
matically produced instance-level mappings so that an [13] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity
agent can make a decision about whether to use them flooding: A versatile graph matching algorithm. In
or not. 18th International Conference on Data Engineering
As the top priorities for the future work currently we are (ICDE), pages 117–128, San Jose (CA US), 2002.
considering the following: [14] A. Nikolov, V. Uren, E. Motta, and A. de Roeck.
Integration of semantically annotated data by the
• Continue more experimental testing with public linked KnoFuss architecture. In 16th International
data sources using detailed ontologies (such as DBPe- Conference on Knowledge Engineering and Knowledge
dia 3.2). Management (EKAW 2008), Acitrezza, Italy, 2008.
• Develop a data fusion service, which can operate on the [15] M. Sabou, M. d’Aquin, and E. Motta. Exploring the
Semantic Web in conjunction with existing linked data Semantic Web as background knowledge for ontology
sources and semantic applications (such as WATSON, matching. Journal of Data Semantics, 2008.
SCARLET, Alignment Server). [16] F. Scharffe and D. Fensel. Correspondence patterns
for ontology alignment. In 16th International
8. REFERENCES Conference on Knowledge Engineering and Knowledge
[1] B. Aleman-Meza, C. Halaschek, A. Sheth, I. B. Management (EKAW 2008), pages 83–92, Acitrezza,
Arpinar, and G. Sannapareddy. SWETO: Large-scale Italy, 2008.
Semantic Web test-bed. In Workshop on Ontology in [17] G. Tummarello, R. Delbru, and E. Oren. Sindice.com:
Action, 16th International Conference on Software Weaving the open linked data. In 6th International
Engineering and Knowledge Engineering (SEKE2004), Semantic Web Conference (ISWC/ASWC 2007),
pages 21–24, 2004. pages 552–565, 2007.
[2] P. Bouquet, H. Stoermer, and B. Bazzanella. An [18] P. Wang and B. Xu. Lily: Ontology alignment results
Entity Name System (ENS) for the Semantic Web. In for OAEI 2008. In 3rd Ontology Matching Workshop
5th Annual European Semantic Web Conference (OM’08) at the 7th International Semantic Web
(ESWC 2008), pages 258–272, 2008. Conference (ISWC’08), Karlsruhe, Germany, 2008.