Towards Data Fusion in a Multi-ontology Environment

                Andriy Nikolov                          Victoria Uren                   Enrico Motta
            a.nikolov@open.ac.uk                   v.s.uren@open.ac.uk               e.motta@open.ac.uk
                                                   Knowledge Media Institute
                                                       Open University
                                                      Milton Keynes, UK


ABSTRACT                                                          the semantic data structure. Mappings between ontology
With the growing amount of semantic data being published          terms are needed to provide a uniform view over individuals
on the Web the problem of finding individuals in different        in two datasets and make the individuals comparable.
datasets which correspond to the same entity is gaining im-
portance. Given that datasets are often structured using
                                                                  2.1   Ontological mismatches and
different ontologies, automatic schema-matching techniques
                                                                        correspondence patterns
have to be utilized before proceeding with data-level align-         Obtaining an adequate representation of mappings which
ment. In this paper we discuss how ontology schema mis-           allows correct data transformation is a non-trivial problem
matches influence data-level alignment based on our first         due to ontology mismatches. A classification framework of
experience with implementing a data fusion tool for a multi-      different types of mismatches between overlapping ontolo-
ontology environment.                                             gies was given in [11]. Assuming that ontologies are repre-
                                                                  sented in the same language, the framework distinguishes:
Categories and Subject Descriptors                                   • Conceptualisation mismatches caused by different ways
H.4.m [Information Systems]: Miscellaneous;                            of domain interpretation. These different ways in turn
D.2 [Software]: Software Engineering                                   may concern:

                                                                         – Scope, when two classes seemingly representing
Keywords                                                                   the same concept do not contain the same in-
Data fusion, coreference resolution, linked data                           stances (e.g., the class PoliticalOrganization in
                                                                           TAP ontology includes terrorist groups, while in
1.   INTRODUCTION                                                          SWETO it is meant to represent only legal organ-
                                                                           isations).
   The data integration process has to deal with two top-level
problems: resolving schema-level and data-level issues. On               – Model coverage and granularity, when parts of the
the Web scale, semantic heterogeneity of data is inevitable,               domain in one ontology are not covered in another
which makes it necessary for a data coreference resolution                 or covered with a different level of detail (e.g., in
system to use results of automatic ontology matching tech-                 SWETO the class Company does not have sub-
niques. These techniques do not guarantee 100% accuracy                    classes while TAP and DBPedia 3.2 distinguish
and errors produced by them may influence the quality of the               between different types of companies).
data fusion stage. In our previous work we developed an ar-
chitecture for semantic data fusion called KnoFuss [14]. The         • Explication mismatches caused by different ways the
initial version of the system was designed for the enterprise          conceptualisation is specified. These are further di-
knowledge management scenario, in which it was assumed                 vided into:
that schema-level issues were resolved and datasets being in-
tegrated were already structured according to the same on-               – Modelling style mismatches, when the same do-
tology. We implemented an extension of the system, which                   main is modeled using different paradigms (e.g.,
utilizes schema-level mappings, produced automatically, to                 point vs interval logic for time representation)
resolve coreferences between datasets using different ontolo-              or concept specification (e.g., splitting the sub-
gies. In this paper we discuss the impact of the ontology                  classes of the same class in a hierarchy according
heterogeneity on the quality of instance coreferencing.                    to different criteria).
                                                                         – Terminological mismatches, when different terms
2.   ONTOLOGICAL MISMATCHES AND                                            are used to represent the same entity (synonymy)
                                                                           or the same term represents different entities
     DATA INTEGRATION ISSUES                                               (homonymy).
  The situation when datasets to be integrated use different
ontologies makes it hard for data integration methods to use             – Encoding mismatches, when the values at the data
                                                                           level have different formats. This one has to be
Copyright is held by the author/owner(s).                                  dealt at the data-level stage, so we do not consider
LDOW2009, April 20, 2009, Madrid, Spain.
.                                                                          it in this paper.
                                                                 Figure 2: Fusion task decomposition incorporating
                                                                 schema matching.

Figure 1: Correspondence patterns of ontology
                                                                 everybody who contributed to a CS paper mentioned in the
matching according to [16] (fragment). A commonly
                                                                 knowledge base. Thus, labels in SWETO are much more
used DisjointClass pattern is included.
                                                                 ambiguous and the danger of matching two unrelated in-
                                                                 dividuals increases, which may affect precision. The same
  To represent correctly the correspondences between on-         happens when there is no equivalence between classes but a
tologies and overcome these mismatches mappings of vary-         Sub-Super-Class relation: the same degree of similarity be-
ing degrees of complexity are required. In [16] common cor-      tween individuals may provide much weaker evidence, which
respondence patterns are introduced to represent such map-       makes it hard to adequately estimate the reliability of meth-
pings (see Fig. 1). For the most part mapping patterns           ods’ output. Another area of impact involves disjointness re-
represent description logic relations. Available automatic       lations. Disjointness between classes can be used as evidence
ontology matching algorithms can only produce a subset of        to consider some coreference mappings incorrect and delete
possible mappings. Given the limited capabilities of ontol-      them. Scope mismatches can lead to errors when classes con-
ogy matching tools we can expect that some of the ontology       sidered disjoint in one ontology are overlapping in another
mismatches will remain unresolved or partially unresolved        one (like in the case with PoliticalOrganization and Ter-
at the data integration stage. Below we try to consider the      roristOrganization above): correct mappings can be deleted
impact of such mismatches during the data integration pro-       if they are perceived as causing inconsistency. Granularity
cess.                                                            mismatches do not allow using ontological constraints de-
                                                                 fined for classes at the lower levels of the hierarchy if the
2.2   Data-level impact of ontology mismatches                   other ontology does not distinguish between these classes.
  The first type of mismatches in the classification presented      Among the explication mismatches modelling style differ-
in [11] concerns conceptualisation. For the coreference reso-    ences are the hardest to solve automatically. Translation
lution stage shared conceptualisation allows the system to:      between paradigms is a very domain-specific problem and
                                                                 common correspondence patterns are often not sufficient to
   • consider individuals belonging to the same class as can-    align two ontologies. In a simple example case, if one ontol-
     didates for matching;                                       ogy represents colours using a set of pre-defined labels (red,
                                                                 yellow, black) and another one uses RGB encoding, it is very
   • estimate the likelihood of individuals being equiva-        hard to find similar values automatically: a hand-tailored
     lent given available evidence (e.g., having two people      matching procedure is necessary. To our knowledge, no ex-
     with the same name belonging to a specific class Se-        isting automatic ontology matching tool is capable of deal-
     manticWebResearcher is a much stronger evidence of          ing with different paradigms. For the case when subclasses
     equivalence than if they only had a generic class Per-      of the same class in two ontologies are split according to
     son in common).                                             different criteria, no useful DL relations can be established
                                                                 between them (apart from the fact that there may be some
Conceptualisation mismatches between two ontologies (in          overlap). Such differences can make any automatic data in-
particular, scope mismatches) may reduce both recall and         tegration procedures intractable. If these mismatches occur
precision of coreference resolution algorithms. For exam-        at lower levels of the hierarchy, methods can operate only
ple, the class Company in SWETO does not include finan-          with information defined at a higher level.
cial organisations, while its counterpart in TAP includes           Finally, terminological mismatches are the primary focus
them. Thus, when the system tries to find for each com-          of most existing ontology matching tools [5], which makes
pany in TAP coreferent individuals in SWETO only having          them the simplest to handle. They can be solved by creating
the equivalence relation between these classes, it will not      EquivalentClass and EquivalentAttribute correspondences.
find matching pairs for financial organisations, because they
belong to a different class in SWETO. This will make the
recall decrease. On the other hand, the class ComputerSci-       3.   KNOFUSS ARCHITECTURE
entist in TAP contains only world-famous computer scien-           The KnoFuss architecture [14] implements a modular frame-
tists while most researchers are classified according to their   work for semantic data fusion. The fusion process is divided
place of work (e.g., CMUPerson, W3CPerson). Computer-            into subtasks as shown in the Fig. 2 and the architecture
ScienceResearcher in SWETO, which automatic tools often          focuses on its second stage: knowledge base integration.
consider equivalent, has much wider coverage and includes        The first subtask is coreference resolution: finding poten-
tially coreferent instances based on their attributes. The        cluded EquivalentClass mappings with classes tap: CMU-
next stage, knowledge base updating, refines coreferencing        Person, tap:ComputerScientist and tap:MedicalScientist.
results taking into account ontological constraints, data con-    Such a variety of potentially corresponding classes is caused
flicts and links between individuals. Algorithms performing       by several existing mismatches between ontologies, in par-
fusion subtasks (e.g., string-based similarity matchers) are      ticular terminological mismatches (Computer Science Re-
represented as problem-solving methods. All methods for           searcher vs ComputerScientist), modelling style mismatches
the same task have a common interface and their capabil-          (tap: CMUPerson includes computer science researchers who
ities (range of applicability and reliability of output) are      worked in the CMU) and conceptualisation scope mismatches
formally defined using the fusion ontology. Because each al-      (tap: ComputerScientist represents only a subset of “world-
gorithm behaves differently depending on the data to which        famous” researchers and tap:Medical-Scientist includes au-
it is applied, optimal parameters can be defined depending        thors of medical AI expert systems). From the strict logical
on the application context (type of data): e.g., Jaro-Winkler     point of view the only correct mapping would be a Sub-
string similarity is appropriate for comparing person names       Super-Class mapping tap:ComputerScientist ⊆ sweto: Com-
but not suitable for publication titles, etc.                     puter Science Researcher. However, excluding other map-
   To deal with the multi-ontology scenario the architecture      pings would remove from consideration many TAP individ-
has to cover the ontology integration stage, which includes       uals, which have their equivalent SWETO counterparts. In
two subtasks: ontology matching and instance transforma-          reality, the data integration system needs information about
tion.                                                             partial alignments between concepts to select individuals
                                                                  which may potentially be coreferent rather than strict logical
3.1    Ontology matching                                          relations. We can call this the OverlapClass correspondence
   The Ontology matching task involves creation of mapping        pattern. Thus, the query from our example is translated
rules or alignments: sets of correspondences between two          into:
ontologies [5].                                                   SELECT ?uri WHERE
   Considering correspondence patterns, data fusion needs            { {?uri rdf:type tap:CMUPerson}
both correspondences between concepts (ClassCorrespon-               UNION {?uri rdf:type tap:Computer Scientist}
dence) and correspondences between properties (Attribute-            UNION {?uri rdf:type tap:Medical Scientist}}
Correspondence). Class mappings allow relevant method             These pairs of queries assumed to be equivalent are then
application contexts to be translated into the terms of the       used at the later stages of the workflow, which allows the
source ontology, if they were initially defined in terms of the   system to operate in the same way as in a single ontology
target ontology. Attribute correspondences are needed in          case. At this stage the system utilizes the DisjointClass
order to retrieve properties relevant for coreference resolu-     mappings. The system uses a simple algorithm to search
tion in both knowledge bases. Equivalence and subsumption         for contradictory mappings: it finds situations when two
relations allow relevant data structures in the source ontol-     classes in different ontologies are connected via a Sub-Super-
ogy to be found. Disjointness relations between concepts          Class mapping (created by ontology matching methods or
are usable for the Knowledge base updating stage, providing       inferred) and at the same time are disjoint (again, directly
evidence for inconsistency resolution. The architecture as-       or via inference). Such mappings are considered conflicting.
sumes that ontology matching methods provide their output         If the DisjointClass mapping has higher confidence then the
in the standard Alignment API format [4].                         contradictory Sub-Super-Class mapping (or the mapping it
                                                                  was inferred from) is removed from consideration.
3.2    Instance transformation
   The goal of the Instance transformation stage is to resolve    4.   EXPERIMENTS
structural differences between two knowledge bases so that           To test the KnoFuss architecture in a multi-ontology sce-
the architecture itself and instance-level methods can pro-       nario we used two artificially created knowledge bases in-
cess individuals in the source and target knowledge bases in      tended to be used as benchmarks for Semantic Web ap-
the same way. Alignments produced by ontology match-              plications: TAP [9] and SWETO testbed [1]. As primary
ing methods are applied to provide a uniform view over            methods for ontology matching we used two tools, which
data in two knowledge bases. In the KnoFuss architecture          participated in the last OAEI contest: CIDER [8] and Lily
SPARQL queries are used as a primary means of retriev-            [18]. Also we used the SCARLET service [15] as a method
ing data (method applicability ranges, application contexts,      for generating DisjointClass mappings using existing ontolo-
sets of relevant attributes). These queries are translated into   gies defined elsewhere on the Web. Assuming that all sib-
the terms of the source ontology using available mappings.        ling classes in the target ontology (SWETO) were mutually
Sometimes a term in the target ontology potentially corre-        disjoint and using equivalence mappings produced by the
sponds to several terms in the source ontology. This happens      CIDER tool we inferred additional disjointness mappings.
when there are several candidate EquivalentClass mappings         Disjointness mappings were used to filter out conflicting
provided by one or several ontology matching tools. In such       equivalence relations with a low reliability. As coreference
situations we combine these mappings and consider them as         resolution methods for instances we used the same string
a single ClassUnion mapping. For instance when we con-            similarity techniques as in our single-ontology scenario ex-
sider the query                                                   periments [14]. While our experiments are still ongoing,
SELECT ?uri WHERE {                                               from these tests we could make several observations.
   ?uri rdf:type sweto:Computer Science Researcher }                 First, as could be expected, errors during schema match-
the system tries to find all ClassCorrespondence mappings,        ing stage are propagated and can potentially lead to signifi-
which include the class sweto:Computer Science Researcher.        cant distortions during instance coreferencing. For instance,
In our example with the CIDER tool (see below) these in-          when matching instances of the class sweto:Company the
CIDER tool incorrectly aligned it with the class tap:Country.     on our experience, we can outline several directions for as-
This led the coreference precision to drop to 41% while it        sisting data fusion in the presence of schema heterogeneity.
reached 74% without this mistake (many companies have                First, label comparison is usually not considered suffi-
names derived from country names). We found ontological           ciently reliable evidence for coreference resolution (e.g., [7]).
constraints to be extremely valuable as a means to repair         However, more complex algorithms utilizing context data
such errors. Apart from the widely used owl:Functional-           (additional properties and links between individuals) can
Property and owl:InverseFunctionalProperty, which allow           only be applied to datasets containing sufficiently overlap-
non-ambiguous instance identification, ontological axioms,        ping data. It can be expected that many data integration
which may lead to inconsistency, allow filtering out incor-       tasks on the Web scale will only be able to rely on in-
rect mappings. These constraints include disjointness and         stance names and thus can only provide suggestions rather
datatype properties with cardinality constraints. E.g., know-     than generate owl:sameAs statements carrying strong im-
ing that Company is disjoint with Country (or inferring           plications. Given that the output is likely to be noisy it is
that) would repair the problem. However, most ontologies          necessary to keep track of data integration decisions (such
do not define these constraints explicitly because they are       as instance coreference mappings or statements considered
not needed in common ontology usage scenarios.                    incorrect) and their provenance. One possible way is to ex-
  Second, although semantic heterogeneity (different mean-        tend the coreference bundles approach [10] to include for
ing attached to similar resources) is seen primarily as a         each URI the confidence of its inclusion into the set.
schema-level knowledge modelling issue, it can cause prob-           Second, considering the limited capabilities of automatic
lems at the instance level as well. For instance, the TAP on-     ontology matching methods, availability of trusted reusable
tology contains a single individual describing the Coca-Cola      schema-level background knowledge is important. Such man-
Company while SWETO contains several individuals de-              ually built reference knowledge is useful when it covers the
scribing Coca-Cola branches in different countries. Whether       gaps existing in common ontology matching scenarios.
such instances should be considered coreferent depends on         Among others, such reference knowledge may include:
the context of the task.
  Then, as for the single-ontology scenario, it is hard to find        • Specifying rich semantic restrictions existing in a cer-
a single instance matching algorithm to apply to all kinds               tain domain, e.g., disjointness relations, property car-
of data: settings have to be optimized for a specific type               dinality and domain/range constraints.
of data rather than for a specific pair of ontologies as in
schema matching. Ontology mismatches may lead not just                 • Covering common ontological mismatches, which can-
to irrelevant instances being compared, but also to instances            not be resolved automatically. For instance, these can
being compared using inappropriate similarity measures.                  include transformation rules between common time
                                                                         modelling approaches and overlaps between subclasses
                                                                         of the same concept divided according to different cri-
5.    DISCUSSION                                                         teria (e.g., classifying historical artifacts from China
   As we said in the beginning, our primary interest when                by centuries or by dynastic periods). In this way a
implementing the version of the KnoFuss architecture to be               complex modelling style mismatch can be reduced to
used in a multi-ontology scenario was to observe the in-                 a terminological one, which can be treated automati-
fluence of schema-level mismatches on the data integration               cally.
stage.
   In comparison with the single-ontology data fusion sce-          Third, sometimes existing automatic matching tools im-
nario, adding the ontology heterogeneity challenge results        pose too rigid restrictions on their output aimed at improv-
both in decreased reliability of methods’ output and diffi-       ing the precision. For instance, some tools (like Lily) pro-
culties in precise estimation of this decrease. For data-level    duce only one-to-one equivalence mappings assuming that
coreference resolution methods we assume that the perfor-         two different classes in one ontology cannot be considered
mance of the method depends on some common features of            equivalent to the same class in another ontology. Thus, only
individuals belonging to a class: this assumption was the         the best candidate for equivalence is selected and all oth-
basis for the usage of application contexts in the KnoFuss        ers are filtered out. While a useful assumption for termi-
architecture. For ontology matching methods even knowing          nological mismatches, it may miss important mappings in
the estimated quality of a method (e.g., precision/recall in      the presence of conceptualisation and modelling style mis-
some test scenarios) it is hard to estimate whether it will       matches. From the data fusion point of view it would be
hold for a different pair of datasets. Second, it is hard to      useful if ontology matching algorithms could produce weak
measure precisely the impact of a single ontology-level error     mapping relations such as ClassOverlap.
at the data level. This possible negative impact can result
in:                                                               6.    RELATED WORK
     • Erroneous widening or narrowing of the applicability          Given the amount of data, which needs to be handled on
       range of integration methods (misaligned concepts).        the Web scale, the need to use automatic coreference reso-
                                                                  lution techniques is recognized in the Semantic Web com-
     • Providing noisy evidence for data-level methods (mis-      munity [2], [7], [6]. Among the existing systems Sindice
       aligned properties and ontological restrictions).          [17] uses a straightforward method for coreference resolu-
                                                                  tion by utilizing explicitly defined key properties (inverse
Finally, some ontological mismatches, such as modelling style,    functional properties). Individuals, which have equal val-
cannot be resolved fully automatically by currently existing      ues for such properties are considered equivalent. This is
tools and can make data-level methods inapplicable. Based         an approach which provides high precision but can only
be applied to a limited subset of data, where such prop-             [3] P. Bouquet, H. Stoermer, and D. Giacomuzzi.
erties are defined explicitly and have values in a standard              OKKAM: Enabling a web of entities. In WWW2007
format. Other tools implement approximate matching tech-                 Workshop i3: Identity, Identifiers and Identification,
niques similar to those created in the database integration              Banff, Canada, 2007.
and ontology matching domains. The OKKAM server [3]                  [4] J. Euzenat. An API for ontology alignment. In 3rd
used the Monge-Elkan string similarity metrics for select-               International Semantic Web Conference, volume 3298
ing coreferent instances in the experiments. RDF-AI [12]                 of Lecture Notes in Computer Science, pages 698–712,
concentrates on data-level issues when combining datasets                Hiroshima, Japan, 2004. Springer.
using the same schema. The algorithm uses string (Monge-             [5] J. Euzenat and P. Shvaiko. Ontology matching.
Elkan) and linguistic (WordNet) similarity to calculate dis-             Springer-Verlag, Heidelberg, 2007.
tance between literal property values and then uses the itera-       [6] A. Ferrara, D. Lorusso, and S. Montanelli. Automatic
tive graph matching algorithm, similar to similarity flooding            identity recognition in the Semantic Web. In
[13], to calculate distance between individuals.                         Workshop on Identity and Reference on the Semantic
                                                                         Web, ESWC 2008, Tenerife, Spain, 2008.
7.    SUMMARY AND FUTURE WORK                                        [7] H. Glaser, I. Millard, A. Jaffri, T. Lewy, and
  We implemented the first prototype of the KnoFuss data                 B. Dowling. On coreference and the Semantic Web. In
integration system for the multi-ontology environment and                7th International Semantic Web Conference (ISWC
performed initial experiments with it. In our view, combin-              2008) (submitted), Karlsruhe, Germany, 2008.
ing automatic schema-level and data-level alignment tech-            [8] J. Gracia and E. Mena. Matching with CIDER:
niques in a single workflow still presents difficulties not only         Evaluation report for the OAEI 2008. In 3rd Ontology
because schema-level matching tools occasionally produces                Matching Workshop (OM’08) at the 7th International
errors, but also because some important types of ontology                Semantic Web Conference (ISWC’08), Karlsruhe,
mismatches are not handled properly by them. In partic-                  Germany, 2008.
ular, this concerns conceptualisation and modelling style            [9] R. V. Guha and R. McCool. TAP: a Semantic Web
mismatches. While being very hard to solve automatically,                platform. Computer Networks, 42(5):557–577, 2003.
there are several ways to assist the coreference resolution         [10] A. Jaffri, H. Glaser, and I. Millard. Managing URI
process when dealing with these mismatches, in particular:               synonymity to enable consistent reference on the
     • Extend the functionality of automatic schema-matching             Semantic Web. In Workshop on Identity and
       tools to discover different types of mappings such as             Reference on the Semantic Web (IRSW2008),
       DisjointClass and OverlapClass.                                   Tenerife, Spain, 2008.
                                                                    [11] M. Klein. Combining and relating ontologies: an
     • Develop and publish reference ontologies explicitly defin-        analysis of problems and solutions. In Workshop on
       ing common relations between concepts and proper-                 Ontologies and Information Sharing, 2001.
       ties, which remain neglected in existing ontologies, in-
                                                                    [12] Y. Liu, F. Scharffe, and C. Zhou. Towards practical
       cluding disjointness relations and translation rules be-
                                                                         rdf datasets fusion. In Workshop on Data Integration
       tween common modelling paradigms.
                                                                         through Semantic Technology (DIST2008), ASWC
     • Maintain provenance and estimated reliability of auto-            2008, Bangkok, Thailand, 2008.
       matically produced instance-level mappings so that an        [13] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity
       agent can make a decision about whether to use them               flooding: A versatile graph matching algorithm. In
       or not.                                                           18th International Conference on Data Engineering
As the top priorities for the future work currently we are               (ICDE), pages 117–128, San Jose (CA US), 2002.
considering the following:                                          [14] A. Nikolov, V. Uren, E. Motta, and A. de Roeck.
                                                                         Integration of semantically annotated data by the
     • Continue more experimental testing with public linked             KnoFuss architecture. In 16th International
       data sources using detailed ontologies (such as DBPe-             Conference on Knowledge Engineering and Knowledge
       dia 3.2).                                                         Management (EKAW 2008), Acitrezza, Italy, 2008.
     • Develop a data fusion service, which can operate on the      [15] M. Sabou, M. d’Aquin, and E. Motta. Exploring the
       Semantic Web in conjunction with existing linked data             Semantic Web as background knowledge for ontology
       sources and semantic applications (such as WATSON,                matching. Journal of Data Semantics, 2008.
       SCARLET, Alignment Server).                                  [16] F. Scharffe and D. Fensel. Correspondence patterns
                                                                         for ontology alignment. In 16th International
8.    REFERENCES                                                         Conference on Knowledge Engineering and Knowledge
 [1] B. Aleman-Meza, C. Halaschek, A. Sheth, I. B.                       Management (EKAW 2008), pages 83–92, Acitrezza,
     Arpinar, and G. Sannapareddy. SWETO: Large-scale                    Italy, 2008.
     Semantic Web test-bed. In Workshop on Ontology in              [17] G. Tummarello, R. Delbru, and E. Oren. Sindice.com:
     Action, 16th International Conference on Software                   Weaving the open linked data. In 6th International
     Engineering and Knowledge Engineering (SEKE2004),                   Semantic Web Conference (ISWC/ASWC 2007),
     pages 21–24, 2004.                                                  pages 552–565, 2007.
 [2] P. Bouquet, H. Stoermer, and B. Bazzanella. An                 [18] P. Wang and B. Xu. Lily: Ontology alignment results
     Entity Name System (ENS) for the Semantic Web. In                   for OAEI 2008. In 3rd Ontology Matching Workshop
     5th Annual European Semantic Web Conference                         (OM’08) at the 7th International Semantic Web
     (ESWC 2008), pages 258–272, 2008.                                   Conference (ISWC’08), Karlsruhe, Germany, 2008.