=Paper=
{{Paper
|id=None
|storemode=property
|title=Queries, the Missing Link in Automatic Data Integration
|pdfUrl=https://ceur-ws.org/Vol-914/paper_41.pdf
|volume=Vol-914
|dblpUrl=https://dblp.org/rec/conf/semweb/TianSM12
}}
==Queries, the Missing Link in Automatic Data Integration==
Queries, the Missing Link in Automatic Data Integration
Aibo Tian, Juan F. Sequeda, and Daniel P. Miranker
Department of Computer Science, The University of Texas at Austin
Austin, Texas, USA
atian@utexas.edu, {jsequeda, miranker}@cs.utexas.edu
Abstract. This paper introduces the ontology mapping approach of a system
that automatically integrates data sources into an ontology-based data integration
system (OBDI). In addition to the target and source ontologies, the mapping al-
gorithm requires a SPARQL query to determine the ontology mapping. Further,
the mapping algorithm is dynamic: running each time a query is processed and
producing only a partial mapping sufficient to reformulate the query.
This approach enables the mapping algorithm to exploit query semantics to cor-
rectly choose among ontology mappings that are indistinguishable when only the
ontologies are considered. Also, the mapping associates paths with paths, instead
of entities with entities. This approach simplifies query reformulation. The sys-
tem achieves favorable results when compared to the algorithms developed for
Clio, the best automated relational data integration system.
We have developed an Ontology-based Data Integration (OBDI) system that departs
from the conventional OBDI organization. The goal is to include automatic integration
of new data sources, provided those data sources publish a self-describing ontology. A
consequence of that goal is there is no longer the opportunity for an engineer to review
and correct an ontology matching prior to its use by the query reformulation system.
As ontology matching is understood to be an uncertain process, some other method of
mapping refinement is needed. Our system uses queries for this purpose.
Ontology mapping in conventional OBDI systems is determined prior to, and with-
out knowledge of the queries to be executed [3]. A static representation of a mapping
between target and source ontologies serves as input to a query reformulation module
(Fig. 1(a)). In the system described here, ontology mapping is a dynamically computed
component whose result depends on the query that is being processed (Fig. 1(b)). In
effect, the query becomes a third argument to the ontology mapping algorithm. The
query provides context for selecting among competing mappings. Since a mapping is
specific to a query, the results may be limited to the partial mapping required by the
query reformulation system.
The organization was motivated by the following observations. A mapping method
may determine that an entity in one ontology maps with equal likelihood to two or
more entities in the other ontology. The mapping and reformulation of certain queries
is correct only if one pairing is chosen. The correct choice may be different for differ-
ent queries. The query itself may lend additional semantics that correctly resolve the
ambiguity.
These observations are supported by the example in Fig. 2. Looking at the ontolo-
gies alone, there is insufficient information to determine if the class T :P eople should
Query q Query q
Ontology T Ontology Query Reformulated Ontology T Ontology Query Reformulated
Ontology S Mapping Reformulation query Ontology S Mapping Reformulation query
(a) Traditional (b) The proposed
Fig. 1. OBDI systems with the traditional and the proposed ontology mapping component.
Course Course P ref ix course : < T /Course > Course
time P ref ix people : < T /P eople > time
hasSchedule
teacher student teachBy takeBy teacher
date Select ?t date
title People name Teacher Student Schedule W here { People
name name name ?c course : time ?t . name
place date
?c course : teacher ?p .
string string date string
?p people : name “Einstein00 .}
(a) Ontology T (b) Ontology S (c) SPARQL query (d) Query graph
Note that since the predicate of a triple pattern is not allowed to be a variable
in our definition, there exists only one query graph for each query q. The query
Fig. 2. Example ontologies and SPARQL query.
graph of the SPARQL query in Figure ?? is shown in Figure ??.
2.3 Problem Definition
A ss-path correspondence records the mapping confidence between two ss-paths.
be mapped to S:T eacher or to S:Student. A third possibility is a one-to-many map-
Definition 9 (SS-PATH CORRESPONDENCE). Given two graphs G and
ping entailing both. However, givenG ,the SPARQL
a ss-path correspondence query inss-paths
between two 0 Fig.p2(c), which
and p (denoted by πasks) is for the 0
p,p0
0 0
a tuple < p, p , c >, such that p ∈ GRAPH-SS-PATH-SET , p ∈ GRAPH-SS-
time of the course that is offered byPATH-SET
“Einstein”, , and c isita is
confidence that T :P eople should only be
clearmeasure. G0
p
p
G
mapped to S:teacher. A complementary We say pquery∈ π , and addressing
p ∈ π . We also student
use α enrollment
to denote the confi-requires
p,p0
0
p,p0 πp,p0
dence measure, which is α = c . In the above definition, we assume the πp,p0 p
T :P eople to be mapped to S:student. Correct answers
correspondence measures equivalence. from a reformulated query can
be achieved only through mutuallyDefinition
exclusive query dependent mappings.
10 (MATCH CANDIDATE). Given a query graph T , a graph q
G is called a match candidate in terms of a set of correspondences Ω , where Tq ,G
An overview of the matching algorithm
Ω = {π :is
p ∈as follows. To exploit
GRAPH-SS-PATH-SET Tq ,G the context},implicit
, p ∈ GRAPH-SS-PATH-SET
p,p0 Tq
0
G
if the following conditions are satisfied:
in a query, the algorithm maps paths in the target ontology to paths in a source on-
– G is a subgraph of S;
tology. A path contains a datatype and – SINKmultiple
⊆ SINK ; classes connected by properties. Each
G S
– for all ss-path p ∈ GRAPH-SS-PATH-SET , there exists exact one ss-path
element can be considered as the context of other
correspondence π ∈ Ωelements
, where p ∈ in the path. For; example, if
GRAPH-SS-PATH-SET p,p0 Tq ,G
0
Tq
G
0
all ss-path p ∈ GRAPH-SS-PATH-SET , there exists exact one ss-path
a path contains a class P eople, and–afor property
correspondence π teaches,
∈Ω then we can infer ; that P eople
, where p ∈ GRAPH-SS-PATH-SET p,p0 Tq ,G
G
Tq
all pair of ss-paths p , p ∈ GRAPH-SS-PATH-SET , if SOURCE
is a T eacher instead of a Student.– for It follows that not all entities in one path have a
= SOURCE , the two corresponded ss-paths p , p ∈ GRAPH-SS-PATH- p2
1 2
0
1
0
2
Tq p1
∈Ω ∈Ω
corresponding entity in the corresponding SET , π
= SOURCE ;
path. Path-based
,π
mapping
G
must necessarily
, also share the same source, SOURCE
p1 ,p01
p02
Tq ,G p2 ,p02 Tq ,G p01
accommodate this. Since the number of unconstrained paths in a graph is much larger
than the number of vertices, the properties suggest that path mapping is a combina-
torially much harder problem. However, as the organization stipulates that ontology
mapping is dynamic and specific to a query. Thus, the mapping may be limited to only
those mappings required to reformulate the query. Relative to ontologies, queries are
very small, and only the paths in the target ontology corresponding to the query need to
be mapped. These constraints limit the search problem to a manageable size.
Consider the specific SPARQL query in Fig. 2(c). Fig. 2(d) illustrates the part of
ontology that corresponds to q, which is called a query graph. The task is to generate
mappings that can be used to reformulate q in terms of ontology S. The algorithm must
address the following challenges. The class P eople in T can be mapped to T eacher or
Student in S, and some entities, such as class Schedule in S, do not have any mapped
entity in T . However, the mapping for Schedule is necessary to reformulate the query.
All of these challenges are met by mapping paths, represented as sequences of la-
bels, where the sequence comprises alternating vertex and edge labels. In the example,
query q has two paths in its query graph:
{T :Course, T :teacher, T :P eople, T :name, string} and {T :Course, T :time, date}
We search for a subgraph with two paths in ontology S, which have the highest
probability such that each of the paths in S are mapped to paths that correspond to the
query graph of q. The probability of each path mapping is determined by scoring the
similarity of all labels in the paths.
Mapping results should be:
{T :Course,T :teacher, T :P eople, T :name, string}
= {S:Course, S:teachBy, S:T eacher, S:name, string}
{T :Course,T :time, date}
= {S:Course, S:hasSchedule, S:Schedule, S:date, date}
Note that this mapping is specific to the query. If another query asks for the time
of the course that is taken by “Einstein”, the mapped path should contain S:Student
instead of S:T eacher. Thus, defining similarity to a sequence of labels identified by
the query introduces context. Limiting the problem to the paths in the query graph not
only limits the size of the computation, but also removes any consideration of poten-
tially conflicting interpretation. Given that sequence matching is an endemic problem
in genomic data processing, there are many avenues open to exploration.
Experimental Setup: The evaluation comprises real world ontologies from bibliogra-
phy domain. DBLP ontology is generated by direct mapping the relational schema of
the DBLP metadata database using Ultrawrap [2], and the UMBC ontology is from the
OAEI benchmark track1 . These ontologies can be found on our website2 . We manually
generated groundtruth mappings between paths. Subsequently, a computer program sys-
tematically generates two kinds of SPARQL queries for each ontology. (1) A PathOnly
query has query graph consisting of only one path in the groundtruth mappings. (2)
A ClassAll query has query graph consisting of the set of all paths that share a same
source in the groundtruth mappings. For each query, the generated path mappings are
evaluated by comparing to the groundtruth mappings. The mapping results are evalu-
ated by three metrics. valid rate measures whether the generated mappings contain all
paths in a query, regardless of correctness. path precision measures the correctness
of individual path mapping. query precision measures whether all path mappings are
correct for a query.
Baseline: Clio is a semi-automatic relational schema mapping system, however, the
resulting algorithms are applicable to ontologies [1]. We implemented multiple config-
urations of Clio as baselines. All baselines first generate mappings between datatype
properties by picking the ones with highest similarities. Given a query, the baselines
find the mapping candidates that contain all the mapped datatype properties. If there
exists more than one candidates, Clio asks a user to make the decision, which is not al-
lowed in our automatic setting. We implement three baselines to approximate this pro-
cess: clio-minimal, clio-maximal, and clio-similar, which chooses the mapping candi-
date with minimal summation of path lengths, maximal summation of path lengths, and
1
http://oaei.ontologymatching.org
2
http://www.cs.utexas.edu/~atian/page/dataset.html
11 1
0.9
0.9 0.9
0.8
0.8 0.8
0.7
0.7 0.7
proposed
proposedapproach
approach proposed approach
0.6
0.6 clio_minimal_top1
clio_minimal_top1 0.6 clio_minimal_top1
clio_maximal_top1
clio_maximal_top1 clio_maximal_top1
0.5
0.5 clio_similar_top1
clio_similar_top1 0.5 clio_similar_top1
clio_minimal_top2
clio_minimal_top2 clio_minimal_top2
0.4
0.4 clio_maximal_top2
clio_maximal_top2 0.4 clio_maximal_top2
0.3
0.3 clio_similar_top2
clio_similar_top2 0.3 clio_similar_top2
0.2
0.2 0.2
0.1
0.1 0.1
00 0
query_precision
query_precision path_precision
path_precision valid_rate
valid_rate query_precision path_precision valid_rate
(a) Bibliography, PathOnly (b) legend (c) Bibliography, ClassAll
Fig. 3. The evaluation results for bibliography data sets.
the highest similarity between the sources of the paths. We also enhance the baselines,
by keeping 2 mappings instead of 1 for each datatype property. We generate mapping
for each of them, and consider the mapping as correct if any of them is correct.
Results: Overall, our approach dominates over all baselines as shown in Fig. 3. For
PathOnly queries, query precision and path precision are the same, and all approaches
have 100% valid rate, because each query only consists of one path in the query
graph. In terms of query precision and path precision, our approach is 0.15 higher
than the best top1 baselines. The three top2 baselines are improved with respect to the
top1 approaches. However, even though the top2 approaches choose the correct map-
ping from large number of candidate mappings, it would still not yield a mapping with
higher precision than our approach. clio similar has better performance comparing to
clio minimal and clio maximal. This indicates that the similarity between sources is
important to the path mapping. For ClassAll queries, none of the baselines are compet-
itive with respect to our approach. This is because the baselines determine the datatype
property and source mappings first, and only use query to find valid path mappings. In
some cases, the valid mappings are not existed. For our approach, the path mappings are
jointly determined by both entity mappings and the query, so we have higher chances
to find valid correct mappings.
References
1. R. Fagin, L. Haas, M. Hernández, R. Miller, L. Popa, and Y. Velegrakis. Clio: Schema map-
ping creation and data exchange. Conceptual Modeling: Foundations and Applications, pages
198–236, 2009.
2. J. F. Sequeda and D. P. Miranker. Ultrawrap: Sparql execution on relational data. Technical
Report TR-12-10, University of Texas at Austin, Department of Computer Sciences, 2012.
3. H. Wache, T. Voegele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, and S. Hübner.
Ontology-based integration of information-a survey of existing approaches. In IJCAI-01 work-
shop: ontologies and information sharing.