<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning data-consistent mappings from a relational database to an ontology</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Informatica e Sistemistica Universita` di Roma 1 “La Sapienza” Via Salaria 113</institution>
          ,
          <addr-line>00198 Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Determining how to map information between different representational models is an essential task of semantic integration. Up to now, most research in this field has concentrated on the problem of finding pairs of elements which could have similar semantics (ie. finding matches). From the work we have seen, it is not clear that any guarantees can be given that the obtained matches respect the intended semantics of the target model. We also argue that matches are not sufficient to clearly define mappings which respect the intended interpretation of the target model. In this paper, we define mapping consistency with respect to an intended interpretation and derive a notion of data-consistency, a necessary condition for consistency. From this, we propose a relational learning algorithm to construct data-consistent mappings.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Building mappings between schema and/or ontologies is one of the central
issues of semantic integration [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Hand crafting mappings is a complex, tedious
and error-prone task. Much research on how to (partially) automate semantic
integration has been lead in both the database [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] and ontology [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
communities. Most of this work has concentrated on finding correspondences, also called
matches, between schema/ontology elements using some form of heuristics.
      </p>
      <p>
        There are two major drawbacks to existing approaches. Firstly, the
approaches we have knowledge of do not rely on any formal notion of intended
interpretation. Most implicitly rely on the quite arguable intuition that two
separately conceived models of similar “worlds” (eg. a computer science department)
will contain similar concepts/roles with similar characteristics. They therefore
can give no guarantee on the correctness (with respect to the intended
interpretation) of the obtained matchings. Secondly, having matches is far from
being sufficient in order to correctly be able to map data between representations.
Correct reasoning (and in particular querying) over integrated ontologies and/or
databases requires having mappings which respect the intended interpretation
of the integration system. In formal frameworks which have been proposed for
both database and ontology integration [
        <xref ref-type="bibr" rid="ref1 ref5">5, 1</xref>
        ], mappings usually take the form of
complex queries. The few systems which do allow to build mappings [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], require
extra knowledge (eg. foreign keys) and only produce mapping suggestions with
no guarantee of their correctness.
      </p>
      <p>In this paper, we propose a framework defining mapping consistency with
respect to an intended interpretation (ie. the one given by an expert) and
derive a notion of data-consistency allowing to check if a mapping is compatible
with a previously given set of positive and negative examples. Given that the
examples are in accordance with the intended interpretation, data-consistency
is a necessary condition for consistency. From this we adapt existing inductive
logic programming techniques to build data-consistent mappings. We consider
here the problem of mapping a relational database into an ontology in a global
as view setting. There are mainly two aspects which make mapping databases
(rather than ontologies) a particular problem. First, databases usually have very
few (if any) intentional knowledge which can be exploited to build mappings.
Second, the objects which are described by a relational database do not have
an explicit representation as they do in ontologies. The mappings we consider
consist of a set of horn clauses where the head of each clause is a concept or role
of the target ontology and the body is a conjunctive query over the relations of
the source schema. In the work presented here, we will consider the case where
each concept or role of the target schema can be described by a unique rule.</p>
      <p>The contributions of this paper are the following : (1) We introduce the notion
of mapping consistency with respect to a given intended interpretation. (2) We
show how a reduced form of consistency, data-consistency, can be tested in the
presence of examples. (3) We propose a relational learning algorithm allowing
to build data-consistent mappings.</p>
      <p>This paper is organized as follows. Section 2 presents related work in semantic
integration an positions our work. Section 3 presents how we handle
instancelevel mappings. Section 4 introduces our working example. Mapping consistency
and data-consistency are presented in section 5. Section 6 proposes a framework
which allows to learn data-consistent mappings. Finally, in section 7 we conclude
and discuss future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background and related work</title>
      <p>Building mappings from one model to another requires to overcome multiple
difficulties both at the conceptual and instance levels. At the conceptual level
choices such as which information should be made explicit, how detailed the
information should be or how it should be organized may differ. For example, one
model might explicitly define a CSP rof essor concept while in another model
this information might be left implicit (as a professor who teaches a computer
science course). At the instance level, different levels of detail may exist between
attribute values. For example, one source may split into pieces an “address” field
which another source may represent as a whole. Another important aspect at
the instance level is that sharing instances between sources requires having a
common naming. The major difficulty to overcome is that any instance (eg. a
particular person) should be represented by a unique object identifier in a target
ontology while it is represented as a particular instance of a table in a database.</p>
      <p>Integrating data from one source to another has two major requirements :
(1) knowing how instances are represented (identified) in each of the models
and how to map from one representation to the other and (2) knowing how to
translate relationships which hold between instances. In both cases, mappings
may be used to represent the relationships. The problems of instance mapping
and schema mapping respectively seek in resolving these two requirements.</p>
      <p>
        Defining mappings between sources is a tedious and error-prone task.
Researchers have therefore looked to (partially) automate semantic integration. Up
to now most research, whether in the database [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or the ontology [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
communities, has concentrated on methods which semi-automatically search for
correspondences between schema elements (eg. attributes in the relational case). A
correspondence between a set of elements A of one schema to a set of elements
B of another schema is called a match (noted A =∼ B). When a set of elements is
greater than one, one must also define how to combine the values. A particular
match might put into correspondence an attribute name of a relation P ersonS
of a source schema with the pair of attributes f irstN ame and lastN ame of a
M emberT relation of a target schema using the concatenation function to join
the first and last names.
      </p>
      <p>
        Different approaches to finding matches have been proposed. Many systems
(eg. [
        <xref ref-type="bibr" rid="ref10 ref7 ref8 ref9">7–10</xref>
        ]) search for pairs of “similar” elements according to different
heuristics. Typical heuristics are of the form, “concepts with similar names are likely to
be similar”. The heuristics are usually formalized as similarity measures which
may take into account different types of information including for example the
names and textual descriptions (eg. comments in the schema) of the elements,
constraints on the values the elements can take (eg. primary key, type,
uniqueness). [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] give a list of the most commonly used informations. Most approaches
also consider the similarity of related elements. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] base their similarity measures
on the basis of a distance to elements known a priori to be the same.
      </p>
      <p>
        Other approaches such as the LSD [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and GLUE [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] systems make use
of machine learning over instance data to build matches. LSD learns
classifiers based on previously hand-coded matches to predict how to match unseen
schemas and combines multiple classifiers some of which rely on having instance
data. GLUE learns to predict similarity between two elements of two sources by
estimating the joint probability that an instance belongs to both elements. Until
recently most systems were only able to suggest one-to-one matches between
elements. More recent systems such as iMAP allow to suggest complex matches
(eg. price = rate ∗ (1 + taxes)) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The suggestions are obtained by multiple
predefined searchers such as text searchers looking for concatenations.
      </p>
      <p>While matches can be helpful to integrate data, we argue that they are
not sufficient to define mappings, since a set of matches can be interpreted in
several ways and some information may be missing. For example, knowing that
P rof essorT matches M emberS does not allow to determine a mapping such as</p>
      <p>
        P rof essorT (x) ← M emberS (x), P rof ession(x, ”prof essor”)
In the literature there have only been few attempts to (partially) automate
mapping construction. An example, is the Clio system [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which suggests mappings
given a set of matches. In particular it searches for joins between relations by
following foreign keys. Depending on existing matches and the possible presence
of constraints can be restrictive. This work does provide a notion of correctness
with respect to the constraints of the source and target schemas but not to the
intended interpretation. The approach proposed in this paper suggests learning
by examples the constraints which must be satisfied by the data to correctly
reflect the intended meaning the mappings should expose.
      </p>
      <p>
        In this paper, we consider using machine learning to build the mappings of
an integration system between a source database and a target ontology in a
global as view perspective. Following [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we are building an integration system
I = hG, S, Mi where G (resp. S) is the global ontology (resp. source schema)
expressed in a language LG (resp. LS ) over an alphabet ΣG (resp. ΣG ) and M
is the mapping between G and S. In our framework M is a set of horn clauses
pG (−→x ) ← conjS (−→x , −→x ) where pG ∈ ΣG and conjS (−→x , −→x ) is a conjunctive query
over ΣS . When learning will we consider a particular interpretation domain
Δ and a given database instance DB of S over Δ as well as a set of positive
and negative instantiation assertions (resp. KB+ and KB−) over ΣS and Δ. As
usual, the semantics of an ontology (similarly for a database) O defined over an
alphabet ΣO is given by an interpretation I = (Δ, ·I ) where Δ is the domain of
interpretation and ·I is an interpretation function which associates each concept
C (resp. each role R) in ΣO a subset of elements (resp. couples) of Δ.
      </p>
      <p>The objects of the interpretation domain Δ will have been obtained after
an initial identification of unique object identifiers as described in section 3.
Our framework currently only considers that the source is a flat database and
therefore contains only positive assertions and no intentional knowledge. Also,
we currently require learning independent mappings for each concept and role
of the target ontology.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Mapping instances to objects</title>
      <p>T elephone</p>
      <p>F name Lname T elephone
people(”Mary”,”Green”) ”Mary” ”Green” 0768
people(”Mary”,”Green”) ”Mary” ”Green” 0679
people(”John”,”White”) ”John” ”White” 0687</p>
      <p>It is common to hypothesize that both the target and source databases are
defined over a common fixed interpretation domain Δ. In the following we will
be proceeding similarly. However, this requires having a common object space,
ie knowing how to map the representation of objects in the concrete source
database (eg. the instance of the table members whose attribute id have the
value 165) into a unique representation at the conceptual level of the target
ontology (eg. the person ’john’). In this section we describe how such identifiers
can be obtained.</p>
      <p>For any given database, we can consider that each line of a table contained
in the database corresponds to an object. In the most favorable case, primary
and foreign keys have be defined in order to determine relationships between
tables. This information can be used to define object instances and relationships
between them. In some cases no primary keys have been given and connections
between tables have not been made explicit or the keys do not give the correct
view of the data and therefore we will not rely on having such data.</p>
      <p>We will rather simply, consider that the user provides for each relation R a set
of attributes ident(R) allowing to construct an identifier for each instance which
corresponds to the target view he wishes to have of the data. This identifier
will be considered to refer to a unique object. Consider a relation R of a given
database DB and a set of identification attributes K = ident(R) for R. Any
instance of R is a tuple (−→x ) which can be uniquely identified in R by −→y =
πKR (−→x ), where πKR (−→x ) represents the selection of values in −→x corresponding to
the attributes K in R. If we have a set of identification attributes for each relation
and associate to each relation a unique functor fR, any object described by the
database can be assigned a unique identifier by applying this functor to the values
of the identification attributes. We can complete the domain of interpretation
of the original database Δv (v for values) with a domain of interpretation for
objects</p>
      <p>fR(πKR (−→x ))/(−→x ) ∈ R, K = ident(R)
Δo =</p>
      <p>[</p>
      <p>R∈ΣS</p>
      <p>Figure 1 gives an example relation where the attributes F N ame and LN ame
have been specified as the identification attributes. The relation also includes
a specific object identifier where the values correspond to the unique object
identifiers for the table. As the figure shows, an identifier is not necessarily
a primary key for the table. For example, in the figure both “Mary Green”
instances are considered to refer to the same person and which in this case
would have two telephone numbers.</p>
      <p>In the remainder of this work we will consider that identifiers have been
previously determined and added explicitly to the relations of the database and
the domain of interpretation shared between the target ontology and source
database will be the set Δ = Δo ∪ Δv. For readability, we will use short names
for each object. For example, we will use john as an object identifier instead of
a functor notation such as people(”J ohn”, ”W hite”).
4</p>
    </sec>
    <sec id="sec-4">
      <title>Introductory example</title>
      <p>Let us consider the following example situation. We have a university with
different departments, one of which is the computer science department. The
university wishes to make use of the data which is already made available in existing
department databases. However, it wants to have access to this data given its
own predefined ontology. This requires building mappings between the
department schema’s and its own ontology. Our working example will consider building
the mappings from the CS department schema to the university ontology.</p>
      <p>Oid
mary
john
charlotte
fred</p>
      <sec id="sec-4-1">
        <title>P eopled</title>
        <p>Ident
1
2
3
4</p>
      </sec>
      <sec id="sec-4-2">
        <title>T eamsd</title>
        <p>N ame Oid Id N ame
”Mary” dbteam 1 ”Databases”
”John” tcteam 2 ”Theorectical CS”
”Charlotte” mlteam 3 ”Machine Learning”
”Fred” systeam 4 ”Systems”
CSP rof essoru(X) ←P eopled(X, Y, ), T eamsd( , , , Y ), T eachesd(X, Y ),</p>
      </sec>
      <sec id="sec-4-3">
        <title>Coursesd( , Y, , ”true”).</title>
        <p>(c) Target mappings</p>
        <p>The CS department schema contains four relations : P eopled which describes
the members of the department, T eamsd which describes the teams of the
department and who they are directed by, T eachesd which relates members to the
courses they teach and Coursesd which contains the set of courses members of
the department participate in. In this department, all professors both direct a
team and teach a (computer science) course. This information is implicit and
has not directly been modeled. Figure 2(a) gives an example partial database
using the schema of the CS department. The relations have a d index to denote
that they belong to the department’s schema.</p>
        <p>We will consider two scenarios. In the first, the university ontology contains
(among others) a P rof essoru concept and, in the second, it contains a more
specific CSP rof essoru concept. In the first (resp. second) scenario, the intended
interpretation of P rof essoru (resp. CSP rof essoru) is any member of the
computer science department which teaches any course (resp. a computer science
course) and directs a team. Figure 2(b) gives the set of instances which appear
in the partial database of figure 2(a) and should (resp. should not) belong to
each of the two target concepts and the correct target mappings for each of them.
In the approach we will describe afterwards, these mappings can be learned by
generalizing the descriptions of a set of example instances (not necessarily all)
we know to belong (or not) to the target concept or role.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Intended interpretation</title>
      <p>In the context of this paper we only consider the alphabet Σ of symbols
associated to the elements of an ontology or schema. In the following, the word schema
may therefore often refer to its associated alphabet.</p>
      <p>During the conception of a schema or ontology, the conceiver has in mind a
specific semantics for the schema elements : their “intended” semantics. In some
way, this means that if the conceiver was presented any “world” to be modeled
by his schema he would be capable of giving the “intended” semantics of his
schema for this world (ie. construct the (extensional) database for which this
“world” is a model). Therefore the conceiver or any person or entity having a
complete understanding of the schema can be modeled as an “expert oracle”.
Definition 1 (Expert oracle). An expert oracle for a given alphabet Σ can be
modeled as a function OΣ which returns for any domain of interpretation Δ an
interpretation function ·I</p>
      <p>In the previous definition, the oracle is defined as a function and thus each
alphabet yields a unique interpretation for each world. This simply suggests that
there is no ambiguity in the idea the conceiver has of the predicates appearing
in Σ. Once we have an oracle OΣ for our schema, the intended interpretation of
a given world is the one returned by the oracle.</p>
      <p>Definition 2 (Intended interpretation). Given a schema Σ an expert oracle
OΣ , and a world Δ, the intended interpretation of Σ over Δ according to OΣ is
I = (Δ, ·I = OΣ (Δ)).</p>
      <p>As previously stated, we will consider working over a common domain of
interpretation Δ. Given a schema Σ what remains to be determined by the
intended interpretations is the set of tuples belonging to each relation in Σ. In
the following we will denote this set by ΔI given the intended interpretation I.</p>
      <p>From a general point of view, a mapping from a source schema ΣS to a target
schema ΣT can be seen as a function m which transforms any database over ΣS
into a database over ΣT . To simplify discourse, we will consider a database to be
a set of ground atoms, which is the case when we have no intentional knowledge.</p>
      <p>A mapping m is consistent when the intended interpretation of ΣT (over Δ)
contains the database obtained by applying m to the intended interpretation of
ΣS (over Δ). m is complete when its application to the intended interpretation
of ΣS contains the intended interpretation of ΣT . Whenever it is both consistent
and complete, we say that m is a perfect mapping.</p>
      <p>Definition 3 (Mapping consistency and completeness). Let m be a
mapping from a source schema ΣS to a target schema ΣT , IS and IT respectively
be the intended interpretations for ΣS and ΣT over Δ.</p>
      <p>– m is consistent iff m(ΔIT ) ⊆ ΔIS
– m is complete iff ΔIS ⊆ m(ΔIT )
– m is perfect if it is both complete and consistent (ie. ΔIS = m(ΔIT ))</p>
      <p>When working with real databases, we only have access to a subset of Δ.
Furthermore in many cases, the effective database we work with is incomplete in
the sense that some relations holding among objects of Δ we know of, might not
appear in the database. In some cases it might be inconsistent with the intended
interpretation in that some relations which appear in the effective database do
not hold in the intended interpretation. In the following work on learning
mappings, we will consider that we may have incomplete data but that it is consistent
with intended interpretation. If Σ denotes a schema, DB denotes the effective
database over Σ and Δe ⊆ Δ denotes the subset of objects we have a
description for in DB and I denotes the intended interpretation of Σ over Δ we have
DB ⊆ ΔeI ⊆ ΔI .</p>
      <p>When considering mapping a source schema S into a target ontology T we
will consider situations where we have an effective database DBS over an effective
subset of objects ΔS ⊆ Δ. Also we will consider two effective sets of instantiation
assertions KBT+ and KBT− over an effective subset of objects ΔT ⊆ Δ (usually
ΔT ⊆ ΔS will also hold). KBT+ (resp. KBT−) is a subset of assertions we know
that hold (resp. do not hold) among objects of ΔT in ΔI .</p>
      <p>In this context and given a mapping m we can define a notion of consistency
with respect to the effective data we have which we will call data-consistency. Of
course, data-consistency is a necessary condition for consistency.
+
Definition 4 (Data-consistency). Given the effective databases DBS , KBT
and (eventually) KBT−, a mapping m is data-consistent with respect to DBS ,
KBT+ and KBT− iff KBT ⊆ m(DBS ) and KBT− ∩ m(DBS ) = ∅</p>
      <p>+
6</p>
    </sec>
    <sec id="sec-6">
      <title>Learning mapping rules</title>
      <p>We now propose an approach to learning mappings which is data-consistent.
As an inductive hypothesis, we will consider that having seen sufficient data
will allow us to consider the mapping to hold for remaining data. Therefore, to
learn a mapping between a source schema ΣS and a target schema ΣT , we will
consider having a database DBS over ΣS and two sets of respectively positive
and negative instantiation assertions KBT+ and KBT−.</p>
      <p>In our framework, we consider mappings which are defined as a set of horn
clauses where the predicate of the head is in the target schema and the predicates
of the atoms of the body are in the source schema. A mapping only exists between
the source and target languages if the target concept can be described precisely
using the source predicates. When building mappings, we will consider this to
be true. When learning a target concept C (resp. role R) we will consider the
restriction of KBT+ (and KBT−) to the subset of atoms which belong to C (resp.
R). DBS will however not be restricted in any manner.</p>
      <sec id="sec-6-1">
        <title>6.1 Instance descriptions</title>
        <p>When building a mapping for a target concept C (and similarly for a role R),
we are trying to build a mapping rule which with the given database DBS
entails at least the positive examples and no negative examples (ie. ensures
dataconsistency). If we consider a particular concept P rof essoru, we are looking for
a description of the instances of the concept P rof essoru using the language
constructs of the schema ΣS. In some sense, we hypothesize that the whole
database describes that each particular instance belongs to the target concept.
For each instance of the target concept, we could build a clause C(o1) ← DB(o1)
where DB(o1) denotes that we have a special focus on the instance o1. The
most specific clause which is consistent with a set of positive instance C+ =
{o1, . . . , on}, would therefore be the least general generalization of the clauses
C(oi) ← DB(oi). However, such an approach will likely require many positive
examples in order to generalize over common content of DB and still contain
sub-clauses which are specific to the database itself and not to the examples.
desc1DB(john) =P eopled(john, 2, ”John”).</p>
        <p>desc1DB(2) =P eopled(john, 2, ”John”), T eaches(2, 1), T eaches(2, 3),</p>
        <sec id="sec-6-1-1">
          <title>T eamsd(tcteam, 2, ”T heoreticalCS”, 4),</title>
        </sec>
        <sec id="sec-6-1-2">
          <title>T eamsd(dbteam, 1, ”Databases”, 2).</title>
          <p>desc2DB(john) =desc1DB(john), desc1DB(2), desc1DB(”John”).</p>
          <p>Given a particular instance of interest, there are some facts (atoms of the
database) which have a greater importance depending on there “proximity” to
the target instance. Given an instance x, we will denote by desc1DB(x) the set
of atoms which appear in DB where x appears as one of the arguments. For
example, the first equation of figure 2(a) gives the description of john if the
database DB is the one of figure 2(a).</p>
          <p>In order to take into account more distant relationships which could be
required to define the target concepts, the description of an instance/value can be
extended with the descriptions of the instances/values of the atoms to which it is
connected. For example, john is connected to the values 2 and ”John” through
the P eopled relation. The description of john can be extended as show in
figure 3. It should be noted that in the presence of foreign keys, we may restrict
description extensions to only follow foreign keys.</p>
          <p>Definition 5 (Description of an instance). Given a database DB and an
instance x, the descriptions of level i are defined recursively as follows :
desciDB(x) =
desc1DB(x) = {A(. . . , x, . . .) ∈ DB}
[</p>
          <p>desci−1(y)
y∈instances(desciD−B1(x))</p>
          <p>The complete description of an instance x is the fix point obtained by
gradiunatlelygerinscurcehmtehnatitndgesincnDlBe(vxe)l, =dedsecs∗DcBn+(x1)(x=). IdtesshconDuBld(xb)ewnhoetered tnhaist, twhiethsomuatlalensyt</p>
          <p>DB
intentional knowledge, a database is always a finite set of atoms. In this case,
the complete description of any instance always exists and is finite (at worst it
is the whole database itself). Also, given a database DB where the underlying
relational structure which has only one connected component then for any
instance x we have desc∗(x) = DB. Otherwise, desc∗(x) is the set of atoms which
form the connected component to which x belongs.
6.2</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>Learning mappings from instance descriptions</title>
        <p>In this section we present our mapping learning approach based on the
generalization of instance descriptions. The general idea behind our approach is that
the description of an instance can be seen as a query which will at least return
the instance itself as a result. Generalizing the complete descriptions of the set of
instances we know to belong to the target concept will generate a query which
will return the initial instances and a subset of the target concept as long as
the mapping can be described as a conjunctive query over the source database.
However, generalizing the complete descriptions will likely be both infeasible and
generate over specific mappings. The approach we propose is to initially learn
mappings using only a local context (low description depth) which will be
gradually augmented if required. The intuition is that, what is likely to characterize
a concept are the features which are the most directly related to the instance.</p>
        <p>
          The generalization operator we will be using is the least general generalization
(lgg) defined by Plotkin [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] which is based on the notion of θ-subsumption.
Definition 6 (θ-subsumption). A clause C θ-subsumes a clause C0 (noted
C C0) iff there exists a substitution θ such that C ⊆ C0θ.
        </p>
        <p>Plotkin’s least general generalization, given two clauses C and C0, produces
the most specific (with respect to θ-subsumption) clause lgg(C, C0) which
generalizes both C and C0.</p>
        <p>Definition 7 (Least general generalization). The least general
generalization lgg(C1, C2) of two clauses C1 and C2 is a clause C such that : (1) C C1,
(2) C C2 and (3) no other clause C0 such that C C0, verifies (1) and (2).</p>
        <p>
          Plotkin has shown that the least general generalizations of two clauses is
unique and can be calculated in polynomial time. However, the obtained clause
is not necessarily reduced. Reducing a clause requires testing θ-subsumption
between clauses which is know to be NP-complete. Recent work have shown
that efficient implementations [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] remain tractable in many practical cases.
Furthermore, in most mappings will likely not require high depths of descriptions
and therefore the clauses to generalize will likely remain small.
        </p>
        <p>Algorithm 1 Algorithm learn(DB, E+, E−)
Input A database base DB, positive examples E+ and negative examples E−
Output A mappings M which covers all E+ and no E−
d ← 0; M ← ∅
while M is not consistent with the examples do
{Expand description depth and restart learning}
d ← d + 1; M ← ∅
while M consistent with E− and there is a T (−→v) ∈ E+ do</p>
        <p>C ← descriptiond(−→v)</p>
        <p>M ← lgg(E ← C, M )
end while
end while
return M</p>
        <p>Algorithm 1 gives describes how a mapping is learned given a set of
positive and negative examples. The lgg operator is the one given in definition 7.
The description function calculates the description of an instance as described
previously.</p>
        <p>The mapping generation process can easily be made interactive. The user
starts by given some positive examples from which the system builds a mapping
and applies it to the source ontology. This results in a set of tuples which the
mapping would transform into the target concept. From these results, the user
can either remove instances which have wrongly been added (generating negative
examples) and/or add new instances as positive examples.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and future work</title>
      <p>In this paper we have proposed an approach to learning mappings between a
database containing the effective data and an ontology. An important
distinction with related work is that our approach builds mappings which completely
define how to transform the representation of instances from one model to
another. Currently, most existing efforts to automate semantic integration have
concentrated on building matches which only relate schema elements together.
Furthermore, we have propose a notion of data-consistency which allows to have
a minimum of guarantee that the semantics of the obtained mappings are correct.</p>
      <p>The current work is mostly at its beginning. We have already started an
implementation of the ideas developed in this paper and plan to finish them in a
near future. We are also willing to test the approach on real-world mapping
construction problems. If the tests are successful we plan to extend the framework to
handling multiple sources. The current approach does not yet consider existing
matches which could be made available by using existing semantic integration
approaches. We have started studying how these could be used to generate initial
mappings which could then be corrected with respect to data-consistency.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Lenzerini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Data integration: A theoretical perspective</article-title>
          . In Popa, L., ed.
          <source>: PODS</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2002</year>
          )
          <fpage>233</fpage>
          -
          <lpage>246</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Doan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          :
          <article-title>Semantic Integration Research in the Database Community : A Brief Survey</article-title>
          .
          <source>AI Magazine</source>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Rahm</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          :
          <article-title>A survey of approaches to automatic schema matching</article-title>
          .
          <source>VLDB J</source>
          .
          <volume>10</volume>
          (
          <year>2001</year>
          )
          <fpage>334</fpage>
          -
          <lpage>350</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Noy</surname>
            ,
            <given-names>N.F.</given-names>
          </string-name>
          :
          <article-title>Semantic integration: A survey of ontology-based approaches</article-title>
          .
          <source>SIGMOD Record</source>
          <volume>33</volume>
          (
          <year>2004</year>
          )
          <fpage>65</fpage>
          -
          <lpage>70</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Calvanese</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giacomo</surname>
            ,
            <given-names>G.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenzerini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Ontology of integration and integration of ontologies</article-title>
          . In Goble,
          <string-name>
            <given-names>C.A.</given-names>
            ,
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.L.,</surname>
          </string-name>
          <article-title>M¨oller</article-title>
          , R.,
          <string-name>
            <surname>Patel-Schneider</surname>
          </string-name>
          , P.F., eds.:
          <source>Description Logics. Volume 49 of CEUR Workshop Proceedings</source>
          ., CEURWS.org (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Popa</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velegrakis</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          , Hern´andez,
          <string-name>
            <given-names>M.A.</given-names>
            ,
            <surname>Fagin</surname>
          </string-name>
          , R.:
          <article-title>Translating web data</article-title>
          .
          <source>In: VLDB</source>
          . (
          <year>2002</year>
          )
          <fpage>598</fpage>
          -
          <lpage>609</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Noy</surname>
            ,
            <given-names>N.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Musen</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>The PROMPT suite: interactive tools for ontology merging and mapping</article-title>
          .
          <source>International Journal</source>
          Human-Computer
          <string-name>
            <surname>Studies</surname>
          </string-name>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Euzenat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valtchev</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Similarity-based ontology alignment in owl-lite</article-title>
          . In de M´antaras,
          <string-name>
            <given-names>R.L.</given-names>
            ,
            <surname>Saitta</surname>
          </string-name>
          , L., eds.: ECAI, IOS Press (
          <year>2004</year>
          )
          <fpage>333</fpage>
          -
          <lpage>337</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Palopoli</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Terracina</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ursino</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Dike: a system supporting the semiautomatic construction of cooperative information systems from heterogeneous databases</article-title>
          . Softw.,
          <string-name>
            <surname>Pract</surname>
          </string-name>
          . Exper.
          <volume>33</volume>
          (
          <year>2003</year>
          )
          <fpage>847</fpage>
          -
          <lpage>884</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Castano</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antonellis</surname>
            ,
            <given-names>V.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Melchiori</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Artemis: A process modeling and analysis tool environment</article-title>
          . In Ling, T.W.,
          <string-name>
            <surname>Ram</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
          </string-name>
          , M.L., eds.
          <source>: ER</source>
          . Volume
          <volume>1507</volume>
          of Lecture Notes in Computer Science., Springer (
          <year>1998</year>
          )
          <fpage>168</fpage>
          -
          <lpage>182</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Ehrig</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sure</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Ontology mapping - an integrated approach</article-title>
          . In Bussler,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Davies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Fensel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Studer</surname>
          </string-name>
          , R., eds.
          <source>: ESWS</source>
          . Volume
          <volume>3053</volume>
          of Lecture Notes in Computer Science., Springer (
          <year>2004</year>
          )
          <fpage>76</fpage>
          -
          <lpage>91</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Doan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domingos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          :
          <article-title>Reconciling schemas of disparate data sources: A machine-learning approach</article-title>
          . In: SIGMOD Conference.
          <article-title>(</article-title>
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Doan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madhavan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domingos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          :
          <article-title>Learning to map between ontologies on the semantic web</article-title>
          .
          <source>In: WWW</source>
          . (
          <year>2002</year>
          )
          <fpage>662</fpage>
          -
          <lpage>673</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Dhamankar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domingos</surname>
          </string-name>
          , P.: imap:
          <article-title>Discovering complex mappings between database schemas</article-title>
          . In Weikum, G.,
          <article-title>K¨onig</article-title>
          ,
          <string-name>
            <given-names>A.C.</given-names>
            ,
            <surname>Deßloch</surname>
          </string-name>
          , S., eds.: SIGMOD Conference, ACM (
          <year>2004</year>
          )
          <fpage>383</fpage>
          -
          <lpage>394</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Plotkin</surname>
          </string-name>
          , G.D.:
          <article-title>A note on inductive generalization</article-title>
          .
          <source>Machine Intelligence</source>
          <volume>5</volume>
          (
          <year>1970</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Maloberti</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebag</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Fast theta-subsumption with constraint satisfaction algorithms</article-title>
          .
          <source>Machine Learning</source>
          <volume>55</volume>
          (
          <year>2004</year>
          )
          <fpage>137</fpage>
          -
          <lpage>174</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>