Queries, the Missing Link in Automatic Data Integration Aibo Tian, Juan F. Sequeda, and Daniel P. Miranker Department of Computer Science, The University of Texas at Austin Austin, Texas, USA atian@utexas.edu, {jsequeda, miranker}@cs.utexas.edu Abstract. This paper introduces the ontology mapping approach of a system that automatically integrates data sources into an ontology-based data integration system (OBDI). In addition to the target and source ontologies, the mapping al- gorithm requires a SPARQL query to determine the ontology mapping. Further, the mapping algorithm is dynamic: running each time a query is processed and producing only a partial mapping sufficient to reformulate the query. This approach enables the mapping algorithm to exploit query semantics to cor- rectly choose among ontology mappings that are indistinguishable when only the ontologies are considered. Also, the mapping associates paths with paths, instead of entities with entities. This approach simplifies query reformulation. The sys- tem achieves favorable results when compared to the algorithms developed for Clio, the best automated relational data integration system. We have developed an Ontology-based Data Integration (OBDI) system that departs from the conventional OBDI organization. The goal is to include automatic integration of new data sources, provided those data sources publish a self-describing ontology. A consequence of that goal is there is no longer the opportunity for an engineer to review and correct an ontology matching prior to its use by the query reformulation system. As ontology matching is understood to be an uncertain process, some other method of mapping refinement is needed. Our system uses queries for this purpose. Ontology mapping in conventional OBDI systems is determined prior to, and with- out knowledge of the queries to be executed [3]. A static representation of a mapping between target and source ontologies serves as input to a query reformulation module (Fig. 1(a)). In the system described here, ontology mapping is a dynamically computed component whose result depends on the query that is being processed (Fig. 1(b)). In effect, the query becomes a third argument to the ontology mapping algorithm. The query provides context for selecting among competing mappings. Since a mapping is specific to a query, the results may be limited to the partial mapping required by the query reformulation system. The organization was motivated by the following observations. A mapping method may determine that an entity in one ontology maps with equal likelihood to two or more entities in the other ontology. The mapping and reformulation of certain queries is correct only if one pairing is chosen. The correct choice may be different for differ- ent queries. The query itself may lend additional semantics that correctly resolve the ambiguity. These observations are supported by the example in Fig. 2. Looking at the ontolo- gies alone, there is insufficient information to determine if the class T :P eople should Query q Query q Ontology T Ontology Query Reformulated Ontology T Ontology Query Reformulated Ontology S Mapping Reformulation query Ontology S Mapping Reformulation query (a) Traditional (b) The proposed Fig. 1. OBDI systems with the traditional and the proposed ontology mapping component. Course Course P ref ix course : < T /Course > Course time P ref ix people : < T /P eople > time hasSchedule teacher student teachBy takeBy teacher date Select ?t date title People name Teacher Student Schedule W here { People name name name ?c course : time ?t . name place date ?c course : teacher ?p . string string date string ?p people : name “Einstein00 .} (a) Ontology T (b) Ontology S (c) SPARQL query (d) Query graph Note that since the predicate of a triple pattern is not allowed to be a variable in our definition, there exists only one query graph for each query q. The query Fig. 2. Example ontologies and SPARQL query. graph of the SPARQL query in Figure ?? is shown in Figure ??. 2.3 Problem Definition A ss-path correspondence records the mapping confidence between two ss-paths. be mapped to S:T eacher or to S:Student. A third possibility is a one-to-many map- Definition 9 (SS-PATH CORRESPONDENCE). Given two graphs G and ping entailing both. However, givenG ,the SPARQL a ss-path correspondence query inss-paths between two 0 Fig.p2(c), which and p (denoted by πasks) is for the 0 p,p0 0 0 a tuple < p, p , c >, such that p ∈ GRAPH-SS-PATH-SET , p ∈ GRAPH-SS- time of the course that is offered byPATH-SET “Einstein”, , and c isita is confidence that T :P eople should only be clearmeasure. G0 p p G mapped to S:teacher. A complementary We say pquery∈ π , and addressing p ∈ π . We also student use α enrollment to denote the confi-requires p,p0 0 p,p0 πp,p0 dence measure, which is α = c . In the above definition, we assume the πp,p0 p T :P eople to be mapped to S:student. Correct answers correspondence measures equivalence. from a reformulated query can be achieved only through mutuallyDefinition exclusive query dependent mappings. 10 (MATCH CANDIDATE). Given a query graph T , a graph q G is called a match candidate in terms of a set of correspondences Ω , where Tq ,G An overview of the matching algorithm Ω = {π :is p ∈as follows. To exploit GRAPH-SS-PATH-SET Tq ,G the context},implicit , p ∈ GRAPH-SS-PATH-SET p,p0 Tq 0 G if the following conditions are satisfied: in a query, the algorithm maps paths in the target ontology to paths in a source on- – G is a subgraph of S; tology. A path contains a datatype and – SINKmultiple ⊆ SINK ; classes connected by properties. Each G S – for all ss-path p ∈ GRAPH-SS-PATH-SET , there exists exact one ss-path element can be considered as the context of other correspondence π ∈ Ωelements , where p ∈ in the path. For; example, if GRAPH-SS-PATH-SET p,p0 Tq ,G 0 Tq G 0 all ss-path p ∈ GRAPH-SS-PATH-SET , there exists exact one ss-path a path contains a class P eople, and–afor property correspondence π teaches, ∈Ω then we can infer ; that P eople , where p ∈ GRAPH-SS-PATH-SET p,p0 Tq ,G G Tq all pair of ss-paths p , p ∈ GRAPH-SS-PATH-SET , if SOURCE is a T eacher instead of a Student.– for It follows that not all entities in one path have a = SOURCE , the two corresponded ss-paths p , p ∈ GRAPH-SS-PATH- p2 1 2 0 1 0 2 Tq p1 ∈Ω ∈Ω corresponding entity in the corresponding SET , π = SOURCE ; path. Path-based ,π mapping G must necessarily , also share the same source, SOURCE p1 ,p01 p02 Tq ,G p2 ,p02 Tq ,G p01 accommodate this. Since the number of unconstrained paths in a graph is much larger than the number of vertices, the properties suggest that path mapping is a combina- torially much harder problem. However, as the organization stipulates that ontology mapping is dynamic and specific to a query. Thus, the mapping may be limited to only those mappings required to reformulate the query. Relative to ontologies, queries are very small, and only the paths in the target ontology corresponding to the query need to be mapped. These constraints limit the search problem to a manageable size. Consider the specific SPARQL query in Fig. 2(c). Fig. 2(d) illustrates the part of ontology that corresponds to q, which is called a query graph. The task is to generate mappings that can be used to reformulate q in terms of ontology S. The algorithm must address the following challenges. The class P eople in T can be mapped to T eacher or Student in S, and some entities, such as class Schedule in S, do not have any mapped entity in T . However, the mapping for Schedule is necessary to reformulate the query. All of these challenges are met by mapping paths, represented as sequences of la- bels, where the sequence comprises alternating vertex and edge labels. In the example, query q has two paths in its query graph: {T :Course, T :teacher, T :P eople, T :name, string} and {T :Course, T :time, date} We search for a subgraph with two paths in ontology S, which have the highest probability such that each of the paths in S are mapped to paths that correspond to the query graph of q. The probability of each path mapping is determined by scoring the similarity of all labels in the paths. Mapping results should be: {T :Course,T :teacher, T :P eople, T :name, string} = {S:Course, S:teachBy, S:T eacher, S:name, string} {T :Course,T :time, date} = {S:Course, S:hasSchedule, S:Schedule, S:date, date} Note that this mapping is specific to the query. If another query asks for the time of the course that is taken by “Einstein”, the mapped path should contain S:Student instead of S:T eacher. Thus, defining similarity to a sequence of labels identified by the query introduces context. Limiting the problem to the paths in the query graph not only limits the size of the computation, but also removes any consideration of poten- tially conflicting interpretation. Given that sequence matching is an endemic problem in genomic data processing, there are many avenues open to exploration. Experimental Setup: The evaluation comprises real world ontologies from bibliogra- phy domain. DBLP ontology is generated by direct mapping the relational schema of the DBLP metadata database using Ultrawrap [2], and the UMBC ontology is from the OAEI benchmark track1 . These ontologies can be found on our website2 . We manually generated groundtruth mappings between paths. Subsequently, a computer program sys- tematically generates two kinds of SPARQL queries for each ontology. (1) A PathOnly query has query graph consisting of only one path in the groundtruth mappings. (2) A ClassAll query has query graph consisting of the set of all paths that share a same source in the groundtruth mappings. For each query, the generated path mappings are evaluated by comparing to the groundtruth mappings. The mapping results are evalu- ated by three metrics. valid rate measures whether the generated mappings contain all paths in a query, regardless of correctness. path precision measures the correctness of individual path mapping. query precision measures whether all path mappings are correct for a query. Baseline: Clio is a semi-automatic relational schema mapping system, however, the resulting algorithms are applicable to ontologies [1]. We implemented multiple config- urations of Clio as baselines. All baselines first generate mappings between datatype properties by picking the ones with highest similarities. Given a query, the baselines find the mapping candidates that contain all the mapped datatype properties. If there exists more than one candidates, Clio asks a user to make the decision, which is not al- lowed in our automatic setting. We implement three baselines to approximate this pro- cess: clio-minimal, clio-maximal, and clio-similar, which chooses the mapping candi- date with minimal summation of path lengths, maximal summation of path lengths, and 1 http://oaei.ontologymatching.org 2 http://www.cs.utexas.edu/~atian/page/dataset.html 11 1 0.9 0.9 0.9 0.8 0.8 0.8 0.7 0.7 0.7 proposed proposedapproach approach proposed approach 0.6 0.6 clio_minimal_top1 clio_minimal_top1 0.6 clio_minimal_top1 clio_maximal_top1 clio_maximal_top1 clio_maximal_top1 0.5 0.5 clio_similar_top1 clio_similar_top1 0.5 clio_similar_top1 clio_minimal_top2 clio_minimal_top2 clio_minimal_top2 0.4 0.4 clio_maximal_top2 clio_maximal_top2 0.4 clio_maximal_top2 0.3 0.3 clio_similar_top2 clio_similar_top2 0.3 clio_similar_top2 0.2 0.2 0.2 0.1 0.1 0.1 00 0 query_precision query_precision path_precision path_precision valid_rate valid_rate query_precision path_precision valid_rate (a) Bibliography, PathOnly (b) legend (c) Bibliography, ClassAll Fig. 3. The evaluation results for bibliography data sets. the highest similarity between the sources of the paths. We also enhance the baselines, by keeping 2 mappings instead of 1 for each datatype property. We generate mapping for each of them, and consider the mapping as correct if any of them is correct. Results: Overall, our approach dominates over all baselines as shown in Fig. 3. For PathOnly queries, query precision and path precision are the same, and all approaches have 100% valid rate, because each query only consists of one path in the query graph. In terms of query precision and path precision, our approach is 0.15 higher than the best top1 baselines. The three top2 baselines are improved with respect to the top1 approaches. However, even though the top2 approaches choose the correct map- ping from large number of candidate mappings, it would still not yield a mapping with higher precision than our approach. clio similar has better performance comparing to clio minimal and clio maximal. This indicates that the similarity between sources is important to the path mapping. For ClassAll queries, none of the baselines are compet- itive with respect to our approach. This is because the baselines determine the datatype property and source mappings first, and only use query to find valid path mappings. In some cases, the valid mappings are not existed. For our approach, the path mappings are jointly determined by both entity mappings and the query, so we have higher chances to find valid correct mappings. References 1. R. Fagin, L. Haas, M. Hernández, R. Miller, L. Popa, and Y. Velegrakis. Clio: Schema map- ping creation and data exchange. Conceptual Modeling: Foundations and Applications, pages 198–236, 2009. 2. J. F. Sequeda and D. P. Miranker. Ultrawrap: Sparql execution on relational data. Technical Report TR-12-10, University of Texas at Austin, Department of Computer Sciences, 2012. 3. H. Wache, T. Voegele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, and S. Hübner. Ontology-based integration of information-a survey of existing approaches. In IJCAI-01 work- shop: ontologies and information sharing.