RETRO: A Framework for Semantics Preserving SQL-to-SPARQL Translation

RETRO: A Framework for Semantics Preserving SQL-to-SPARQL Translation JyothsnaRachapalli The University of Texas at Dallas

800 West Campbell Road 75080-3021 Richardson TX USA

VaibhavKhadilkar The University of Texas at Dallas

800 West Campbell Road 75080-3021 Richardson TX USA

MuratKantarcioglu muratk@utdallas.edu The University of Texas at Dallas

800 West Campbell Road 75080-3021 Richardson TX USA

BhavaniThuraisingham The University of Texas at Dallas

800 West Campbell Road 75080-3021 Richardson TX USA

RETRO: A Framework for Semantics Preserving SQL-to-SPARQL Translation 15F68829A30931C5BBF3D9C4C3D4E994 GROBID - A machine learning software for extracting information from scholarly documents Database Interoperability Query Translation RDF RDBMS Semantic Web SQL SPARQL

There have been several attempts to make RDBMS and RDF stores inter-operate. The most popular one, D2RQ, has explored one direction, i.e., to look at RDBMS through RDF lenses. In this paper we present RETRO, which investigates the reverse direction, i.e., to look at RDF through Relational lenses. RETRO generates a relational schema from an RDF store, enabling a user to query RDF data using SQL. A significant advantage of this direction in-addition to interoperability is that it makes numerous relational tools developed over the past several decades, available to the RDF stores. In order to provide interoperability between these two systems one needs to resolve the heterogeneity between their respective data models and include schema mapping, data mapping and query mapping in the transformation process [5]. However, like D2RQ, RETRO chooses not to physically transform the data and deals only with schema mapping and query mapping. RETRO's schema mapping derives a domain specific relational schema from RDF data and its query mapping transforms an SQL query over the schema into a provably equivalent SPARQL query, which in-turn is executed upon the RDF store. Since RETRO is a read-only framework, its query mapping uses only a relevant and relationally complete subset of SQL. A proof of correctness of this transformation is given based on compositional semantics of the two query languages.

Introduction

Why not relational databases for Semantic Web? Since its advent in the 1970's, relational databases have given rise to numerous applications and tools that cater to a wide variety of user requirements. The reasons for the dominance of relational databases for the past several decades are not trivial. It has continually offered the best mix of simplicity, expressivity, performance and robustness in managing generic data. RDBMS has its logical foundations in first-order predicate logic and set theory. It is basically made up of a schema and instance data. The data in a DB is operated upon by relational algebra which can be considered as an abstraction of SQL in a broad sense. The most prominent feature of RDBMS is not reasoning about schema [10], but query answering over the data. Query answering in a database is not logical reasoning, but finite model checking where the model is the given database instance and represents exactly one interpretation [2, Chapter 2]. Consequently, absence of information in a database instance is interpreted as negative information. In other words, traditional semantics in RDBMS is characterized as a closed-world semantics. However, over time, application requirements have evolved, giving rise to a need to support reasoning or entailment in query answering. A key example of this being World Wide Web or Semantic Web. Knowledge bases in Semantic Web are based on Description logics which generally are decidable fragments of firstorder logic. Unlike RDBMS, query answering in Semantic Web knowledgebases uses reasoning or logical entailment by considering all possible interpretations, which essentially is its open-world semantics. In addition to query answering, nature of data is an additional factor that leads to choice of RDF model over relational model. The data pertaining to RDBMS applications is homogeneous in nature as it is usually a complete description of a particular domain. This makes it amenable to being modeled as entities and relationships with only occasional problems of null values. On the other hand, data in the case of Semantic Web applications is most likely an incomplete instance dataset and such information cannot be modeled as entities and relationships without giving rise to countless null values. Factors such as incompleteness of data, non-finiteness of domain, query answering based on entailment and support for rich expressivity make RDF/OWL a better choice over relational model/RDBMS for Semantic Web applications.

Why RETRO, why bridge the gap in reverse direction?

Owing to the time Relational databases have been around, they have developed into a mature technology, dominating the persistence mechanism market with several well-established vendors and standards such as SQL and JDBC, which are well defined and accepted. For the same reasons, experience base of developers for relational databases significantly outnumber RDF specialists. There has been a considerable growth in RDF data and a lot of valuable data now resides in RDF stores. Since a vast majority of RDBMS developers may have little familiarity and limited resources to invest in Semantic Web and understand its underlying principles, tools and technologies, it becomes imperative to provide them with a means to conveniently query the existing RDF data. In-order to build such a tool, one has to step into their shoes and assess their requirements as they think relationally and know little about the expressivity supported by RDF/OWL but are indeed looking forward to use/query the RDF data with minimal effort. A tool like RETRO will enable one to effortlessly reuse existing data visualization, data reporting and data mining tools available for relational data. Additionally, commercial applications that use semantic web technologies are impeded because there are much fewer Business Intelligence tools available for them when compared with tools available for relational data. RETRO will not only help commercial applications use semantic web technologies but also serve to cordially welcome RDBMS users to the brave new world of semantic web. A curious RDBMS user may eventually be instigated to leverage the expressivity supported by SPARQL and get motivated to learn more about it and other semantic web technologies. It may thus eventually lead to higher acceptance of semantic web (SW) and bring more work force into SW research. Integrating various databases is often a motivation for adopting RDF [3] and this in turn also serves as a motivation for RETRO, which addresses aforementioned issues of RDBMS folks such as ease of adaptability, backwards compatibility and painless migration to newer DB/Web technologies (RDF/Semantic Web) by bridging the gap between the two fundamentally different data management systems.

How is the gap bridged to achieve interoperability?

RETRO facilitates interoperability between RDF stores and RDBMS without physically transforming the data. This interoperability is achieved in two steps, namely schema mapping and query mapping. In schema mapping, RETRO aims to provide RDBMS users access to RDF data by creating a relational schema that directly corresponds to the RDF data. The relational schema is obtained by virtually, vertically partitioning [1] RDF data based on predicates. The rationale behind this approach is described in the section on schema mapping. The resulting relational schema consists of relations of degree two, i.e., the subject and object of a predicate form the attributes of these relations. The user can now create an SQL query over this relational schema. This SQL query is then converted into a semantically equivalent SPARQL query using RETRO's Query mapping procedure. The SPARQL query is now executed over the RDF store and the results are presented to the user in relational format. Along the following lines we state some assumptions made for query mapping/translation. A query translator is not a query processor and is therefore not responsible for verifying semantic correctness of the source query and assumes it to be correct. Given a semantically correct source query, the translator transforms it from source language to target language by preserving the semantics. Figure 1 shows the schema and query mapping components of RETRO.

Problem Scope: Formal semantics of SPARQL was first discussed in an influential work by Perez et. al [12] in 2006 and was followed and adapted by the W3C standard specification of SPARQL [13] in 2008. In [12], the authors present algebraic syntax with compositional semantics of SPARQL. We reuse both of these features of SPARQL as it is the target language of our query translation. In other words, we start with an algebraic syntax and compositional semantics of relational algebra and map it to the algebraic syntax and compositional seman- tics of SPARQL respectively. Since we reuse the compositional semantics from [12], the assumptions made in it also hold in this paper, which we briefly restate here. Algebraic formalization of the core fragment of SPARQL is done over simple RDF graphs without RDFS vocabulary and literal rules, and set semantics is used instead of bag semantics as implied in the SPARQL specification.

Problem Definition: Given an RDF dataset and a set of relational algebra operators, the goal is firstly to derive a relational schema that accurately represents the RDF data and secondly, to map each of the relational algebra operators (applied to operands from relational schema) to a semantically equivalent SPARQL algebra operator or a set of operators. The result set obtained upon SPARQL query evaluation on the RDF store is presented to the user as a relation.

Contributions: We describe a means of computing domain specific relational schema from a given RDF data-store. We define a relationally complete [6] core fragment of SQL/relational algebra keeping in mind the read-only aspect of the application. We then describe compositional semantics of this core fragment of relational algebra. Subsequently, we describe compositional semantics of SPARQL based on work done in [12] and [5]. Finally, we give a proof of semantics preservation of the translation based on compositional semantics of the two query languages. The aforementioned abstractions, restrictions and assumptions made for query translation lend to the formulation of a simple proof of correctness, by retaining the core complexity of the problem. To the best of our knowledge, this paper presents the first formal semantics preserving translation from SQL to SPARQL.

RETRO Framework: Schema Mapping

ER model and Description logics both model a given domain using classes and relationships. Calvanese el al., in [4] define a translation function that transforms ER model into a ALUN I, DL knowledge base [4] and they prove that this transformation is information preserving. Relational databases have ER model as the most popular conceptual data model and can be mapped to a description language that closely corresponds to the constructs of ER model. However, in Semantic Web there are many flavors of description languages with varying expressivity and computational properties that can be used to model a given domain. It is not possible to map or transform every description language to ER model as they may have different expressivity, and this is the reason why usual ER to RDF mappings cannot be leveraged to do a reverse transformation in RETRO. In our framework, we derive a domain specific relational schema from the ABox part of the DL knowledge base and do not explicitly consider the TBox. The schema mapping algorithm traverses the entire set of ABox triples and extracts every unique predicate that is stored separately for later use. The transformation or derivation of relational schema from RDF store using the schema mapping algorithm is trivially an information preserving translation. The map P returned by Algorithm 1 can be used to display the relation schema to users. A relation schema is constructed by extracting each of the predicate names from P and adding attributes s and o. A relation instance, p I , in a schema S, for a predicate p, can be constructed by the evaluation of a triple pattern, tp, such that tp.s =?s i ∧ tp.p ∈ P ∧ tp.o =?o i , over a triple store. A pair of values (s j , o j ) that are bound to variables (?s i , ?o i ) form a tuple in p I . Algorithm 1 also ranks the predicates in the order in which they are discovered. The significance of ranking will become evident in the next section. We now give an example of a RDF dataset D, which is an extension of the dataset from [12], and its corresponding relational schema S generated by Algorithm 1. We also give an example of an instance of relation "name" that can be constructed from D. When the number of predicates in a ABox is very large, we envisage providing a user with an auxiliary schema for the sole purpose of understanding the structure of data to help them with query formulation. However, the implementation of such a schema is left as a part of future work. We would now like to delve into how RETRO as a tool caters to knowledge evolution and ontology dynamics. Knowledge representation is an important aspect of knowledge. As mentioned earlier, over the past several decades application requirements have evolved and commensurately the ways to represent knowledge have also evolved. However, the gap between the users of the two knowledge representation methods, namely relational and RDF represenatation, still remains. Often, an analogy is established between databases and DL knowledgebases by comparing the schema of a database to a TBox and a database instance to an ABox. This analogy acts as a stepping stone towards the process of bridging the gap between them. RETRO supports evolution of knowledge representation by bridging the gap between the two knowledge representation paradigms through schema mapping and query mapping. RETRO will additionally facilitate application of existing database technologies for e.g., data analysis tools used in stream databases [11], to RDF data. This will enable one to capture and analyze the dynamics of RDF data sets using the aforementioned tools, which are much more advanced than the current Semantic Web state of the art.

Algorithm 1 Schema-Mapping()Input: ABox Output: A map, P (relation name → rank) 1: P ← ∅

RETRO Framework: Query Mapping

In this section we first illustrate the SQL-to-SPARQL translation with some examples and then describe an abstract grammar for SQL which is used to build parse trees for a given SQL query. We then describe an algorithm that translates a given SQL query into a semantically equivalent SPARQL query. In our examples we use the RDF dataset D and its associated relational schema S that were defined in the previous section.

In Table 3, we give a description of an SQL query on schema S, its equivalent SPARQL query over dataset D obtained from Algorithm 2 and subsequently, we show the result of evaluating the SPARQL query on D. The SQL queries use the following relational algebra operators: Join, Union and Difference. Additional example queries for other operators can be found in our technical report [14].

Abstract Grammar for SQL

As shown in the abstract grammar below, an input SQL query for the translation can be a simple Cross product query, Select-Project-Join query or queries with Union, Intersection or Difference (Except) operators. Additionally, the Query-Mapping algorithm (Algorithm 2) supports equi-join and does not consider conditional join because of the nature of RDF data. Conditional joins are meaningful in relational data, however, since the relational schema is derived from RDF data and joins over the derived schema are usually performed over attributes which are URI's, conditional joins do not come into picture [14].

SqlQuery

Query Mapping algorithms

The conceptual evaluation strategy of an SQL query first computes the crossproduct of the tables in the FROM list, then deletes the rows in the cross-product that fail the qualification conditions and finally deletes all the columns that do not appear in the SELECT list. Additionally, if DISTINCT is specified, duplicate rows are eliminated. Algorithm 2 is used to construct a SPARQL query and is based on the conceptual query evaluation strategy of SQL. It takes as input an SQL query string and a map P generated from Algorithm 1, and returns an equivalent SPARQL query string. The SQL query string is first used to generate a parse tree using function parse(SQLQuery). We then extract SQL SELECT, FROM, WHERE-JC (Join Conditions of the form: Attr Op Attr) and WHERE-BE (Boolean Expressions of the form: Attr Op V alue) clause strings from the parse tree. The procedure then performs a query translation based on the type of a query. Algorithm 2 makes use of several sub-procedures to construct a SPARQL query. We briefly explain each of these sub-procedures below. An interested reader can find detailed explanations of these sub-procedures in our technical report [14].

Algorithm 2 Query-Mapping()

Input: An SQL query, qin, and a map P Output: A SPARQL query, qout 1: qout ="" {A SPARQL query that is initially blank} 2: tree = parse(qin) { A parse tree obtained by parsing qin} 3: q SELECT in = tree.getSelectClause(); q F ROM in = tree.getF romClause(); 4: q W HERE−JC in = tree.getJoinConditions() q W HERE−BE in = tree.getBooleanExpr() 5: q SELECT out ="SELECT " q W HERE out ="W HERE { " 6: T P ← ∅ { The map of triple patterns is initially empty} 7: if tree.type = CrossProductQuery then 8:

T P = TransSQLFromClause(q F ROM in ) 9:

q W HERE out + = TransSQLWhereClause(q W HERE−JC in , q W HERE−BE in , T P, P ) 10:

q SELECT out + = TransSQLSelectClause(q SELECT in , T P ) 11: qout = q SELECT out + q W HERE out

+" }" 12: else if tree.type = SPJQuery then 13:

T P = TransSQLFromClause(q F ROM in )) 14:

q W HERE out + = TransSQLWhereClause(q W HERE−JC in , q W HERE−BE in , T P, P ) 15:

T P = extractT P (q W HERE out ) 16:

q SELECT out + = TransSQLSelectClause(q SELECT in , T P ) 17: qout = q SELECT out + q W HERE out +" }" 18: else if tree.type = UnionQuery then 19: q1 = tree.lef tSubT ree(), q2 = tree.rightSubT ree() 20:

q out 1 = Query-Mapping(q1), q out 2 = Query-Mapping(q2) 21: qout = Merge(q out 1 , q out 2 , "UNION") 22: else if tree.type = IntersectQuery then 23: q1 = tree.lef tSubT ree(), q2 = tree.rightSubT ree() 24:

q out 1 = Query-Mapping(q1), q out 2 = Query-Mapping(q2) 25: qout = Merge(q out 1 , q out 2 , "INTERSECT") 26: else if tree.type = ExceptQuery then 27: q1 = tree.lef tSubT ree(), q2 = tree.rightSubT ree() 28:

q out 1 = Query-Mapping(q1), q out 2 = Query-Mapping(q2) 29: qout = Merge(q out 1 , q out 2 , "EXCEPT") 30: end if 31: return qout TransSQLFromClause: This sub-procedure takes as input a set of table names from a SQL FROM clause, R, and returns a map from table names to triple patterns, T P . For each table name that is present in a FROM clause, the sub-procedure generates a triple pattern of the form, ?s i r i ?o i , where r i denotes a given table name. A generated triple pattern is then added to map T P indexed by its corresponding table name.

TransSQLWhereClause: This sub-procedure takes sets of join conditions JC and boolean expressions BE, a map T P from TransSQLFromClause and map P generated from Algorithm 1 as inputs and returns a SPARQL WHERE clause string. In this sub-procedure, we use the aforementioned ranking of relations to sort the set JC to ensure that a consistent join ordering is obtained. We present the following example to show the necessity of ranking relations: Consider relations name, age and dept with ranks 1, 2 and 3 and a query containing only join conditions "name.s = age.s AND dept.s = age.s". If we do not rank relations, we would get the SPARQL WHERE clause: {?s 1 name ?o 1 . ?s 3 age ?o 2 . ?s 3 dept ?o 3 }. However, using a sorting procedure, the second join condition would become age.s = dept.s, since the rank of dept (3) was originally > the rank of age (2). The sorted join conditions now lead to the correct WHERE clause: {?s 1 name ?o 1 . ?s 1 age ?o 2 . ?s 1 dept ?o 3 }. Thus, the ranking of relations used in join re-ordering ensures generation of a consistent SPARQL WHERE clause. Note that the process of ranking relations and sorting join conditions is merely an implementation detail and has little conceptual significance. The sub-procedure then uses the sorted join conditions to generate a part of the SPARQL WHERE clause string. This is followed by generating filter conditions using triple patterns in T P and boolean expressions present in a SQL query.

TransSQLSelectClause: This sub-procedure takes as input a set of SQL SELECT attributes, A, and map T P , and returns a SPARQL SELECT clause string. It generates the variables for SPARQL SELECT by iterating over the set A, where an attribute a i ∈ A is of the form "relationName.attributeName". During an iteration, the relation name part of an attribute a i is used to access the corresponding triple pattern from T P and the attribute name part of a i is used to access the desired variable within this triple pattern.

Merge: This sub-procedure is used to merge two generated SPARQL subquery strings for SQL queries containing Union, Intersection or Difference operators. The sub-procedure first extracts the necessary components of both SPARQL sub-queries (SELECT clause, triple patterns and filter expressions in WHERE clause). Next, the two sub-queries are combined into a merged query using various operations depending on the operator in the original SQL query. These operations include concatenation of triple patterns of both sub-queries, variable renaming to ensure consistency and introduction of specific keywords.

Semantics Preserving SQL-To-SPARQL Translation

An important principle of compositional semantics is that the semantics should be compositional i.e. the meaning of a program expression should be built out of the meanings of its sub-expressions. We present compositional semantics in terms of abstract syntax definition of the language, semantic algebras and the valuation functions.

-Abstract Syntax: It is commonly specified as BNF grammar and it maps expressions in the language to parse trees. A Syntax domain is the term used to denote a collection of values with common syntactic structure. -Semantic Algebra: A Semantic algebra consists of a semantic domain accompanied by a corresponding set of operations. A semantic domain is a set of elements grouped together because they share some common property, for e.g., set of natural numbers, set of tuples in a relation etc. The operations are functions that need arguments from the domain to produce answers, for e.g., selection, projection, join, are meaningful operations over the domain of relations. An Operation is defined in terms of its functionality and a description of its mapping.

• Functionality: For a function f , its functionality is an expression from domain and co-domain denoted by f :

D 1 × D 2 × ... × D n → A,

Compositional Semantics of SPARQL

In this section, we first describe the related formal notation and then give abstract syntax, semantic algebra of mapping sets and valuation functions as the compositional semantics for SPARQL.

-RDF terms: Denoted by T , it comprises of pairwise disjoint infinite sets of I, B and L (IRI's, Blank nodes and Literals respectively).

-Triple: A triple (s, p, o) ∈ (I ∪ B) × I × (I ∪ B ∪ L),

where s is a subject, p is a predicate, and o is an object. -Triple Pattern: A triple pattern, tp, is a triple of the form (sp, pp, op)

∈ (I ∪ V ∪ L) × (I ∪ V ) × (I ∪ V ∪ L)

, where V is an infinite set of variables that is pairwise disjoint from the sets I, B, and L; and sp, pp, and op are a subject pattern, predicate pattern, and object pattern respectively. -A mapping µ is a partial function µ : V → T . Given a triple pattern t, µ(t) denotes the triple obtained by replacing the variables in t according to µ. Domain of µ, dom(µ) is the subset of V where µ is defined and Ω is a set of mappings µ. Two mappings, µ 1 and µ 2 are compatible when for all x ∈ dom(µ 1 ) ∩ dom(µ 2 ), it is the case that µ 1 (x) = µ 2 (x).

Abstract Syntax for SPARQL

The abstract syntax shown below describes the structure of graph pattern expressions.

gp Operations defined over the semantic domain Ω are:

1. The σ expr operator takes a mapping set as an input and retains only the mappings that satisfy the expression expr and returns the resulting mapping set. Its functionality is denoted by σ expr : Ω → Ω and description by:

-σ expr Ω = {µ|µ ∈ Ω ∧ µ expr} 2.

The ∏ vars operator takes an input mapping set and retains only the variables belonging to the set vars in each mapping and returns the resulting mapping set. Its functionality is given by ∏ v1,v2,..,vn : Ω → Ω and description by: -∏ v1,v2,..,vn Ω = {µ |v1,v2,..,vn |µ ∈ Ω} 3. The Union operator ∪, takes as input two mapping sets and computes the set union and returns the resulting mapping set. Its functionality is given by ∪ : Ω × Ω → Ω and description by:

Ω 1 ∪ Ω 2 = {µ|µ ∈ Ω 1 ∨ µ ∈ Ω 2 } 4.

The Join operator's functionality is given by : Ω × Ω → Ω, description by:

Ω 1 Ω 2 = {µ 1 ∪ µ 2 |µ 1 ∈ Ω 1 , µ 2 ∈ Ω 2 are compatible mappings} 5. Difference operator's functionality is : Ω × Ω → Ω, description is:

Ω 1 Ω 2 = {µ ∈ Ω 1 | for all µ ∈ Ω 2 , µ and µ are not compatible } 6. The Optional operator's functionality is Ω × Ω → Ω, description is: -Ω 1 2 = {(Ω 1 Ω 2 ) ∪ (Ω 1 Ω 2 )}

Valuation functions

[]: gp → Ω ; The valuation function [], maps an element of gp to Ω, i.e, it maps the language's abstract syntax structures to meanings drawn from semantic domains. The domain of a valuation function is a set of derivation trees of the language and is defined structurally. It determines meanings of its subtrees and combines them into meaning of the entire tree.

-[tp ]= {µ|dom(µ) = var(tp) ∧ µ(tp) ∈ G} -[gp F ILT ER expr]= σ expr [gp ] -[P ROJECT v1,v2,..,vn gp]= ∏ v1,v2,..,vn [gp ] -[gp 1 U N ION gp 2 ]= [gp 1 ]∪ [gp 2 ] -[gp 1 AN D gp 2 ]= [gp 1 ] [gp 2 ] -[gp 1 OP T gp 2 ]= [gp 1 ] [gp 2 ]

The meaning of expression [gp 1 U N ION gp 2 ]is computed by first evaluating [gp 1 ]to get Ω 1 and [gp 2 ]to get Ω 2 and then computing Ω 1 ∪Ω 2 using the mapping set's semantic algebra operation union, defined earlier.

Given a mapping µ and expression expr, µ satisfies expr (µ expr), iff (i) expr is bound(?X) and ?X ∈ dom(µ);

(ii) expr is ?X op c, ?X ∈ dom(µ) and µ(?X) op c;

(iii) expr is ?X op ?Y , ?X, ?Y ∈ dom(µ), and µ(?X) op µ(?Y ), where op ∈ {< | ≤ | ≥ |>| =}; (iv) expr is (¬expr 1 ) and it is not the case that µ expr 1 ; (v) expr is (expr 1 ∧ expr 2 ) and µ expr 1 and µ expr 2 ; (vi) expr is (expr 1 ∨ expr 2 ) and µ expr 1 or µ expr 2 ;

Compositional Semantics of Relational Algebra

The close relationship between SQL and Relational Algebra is the basis for query optimization in a RDBMS [15], and in our work it forms the basis for query translation. In order to simplify the proof, we use relational algebra based syntax and define its compositional semantics by using tuple relational calculus formulas since they closely mirror the mapping based semantics of SPARQL. The proof of equivalence between relational algebra and relational calculus presented in [7] enables us to define the compositional semantics in this manner.

Abstract Syntax for Relational Algebra The abstract syntax describes the structure of relational algebra (RA) expressions. E ∈ Expression, where E represents the syntax domain of RA expressions.

E → σ Cond E| ∏ A1,A2,..,An E|E 1 ∪ E 2 |E 1 Cond E 2 |E 1 -E 2 |E 1 ∩ E 2 |E 1 × E 2

Semantic Algebra for the domain of relations Semantic Domain: Relation, R ; A relation is a set of all tuples such that they are all defined over the same schema or set of attributes.

Operations defined over the semantic domain R are: -

R 1 ∪ R 2 = {t|(R 1 (t) ∨ R 2 (t)) ∧ (ξ(R 1 ) ≡ ξ(R 2 ))}

A Union operation can be performed only over union compatible relations i.e. their schemas should be identical and it is denoted by the condition (ξ(R 1 ) ≡ ξ(R 2 )) where ξ(R) stands for schema of a relation R. Further explanation of the proof can be found in [14].

The Join operator's functionality isCond : R 1 × R 2 → R and description is: -R 1 Cond R 2 = {t|∀t 1 , ∀t 2 (R 1 (t 1 ) ∧ R 2 (t 2 ) → t.A 1 = t 1 .A 1 ∧ t.A 2 = t 1 .A 2 ∧ ... ∧ t.A n = t 1 .A n ∧ t.B 1 = t 2 .B 1 ∧ t.B 2 = t 2 .B 2 ∧ ... ∧ t.B n = t 2 .B n ∧ Cond(t))} 5. Difference operator's functionality is -: R × R → R and description is: -R 1 -R 2 = {t|R 1 (t) ∧ ∀t 2 (R 2 (t 2 ) → t = t 2 )} 6. Intersection operator's functionality is ∩ : R × R → R and description is: -R 1 ∩ R 2 = {t|(R 1 (t) ∧ R 2 (t)) ∧ (ξ(R 1 ) ≡ ξ(R 2 ))} 7. Cross product operator's functionality is × : R × R → R and description is: -R 1 × R 2 = {t|∀t 1 , ∀t 2 (R 1 (t 1 ) ∧ R 2 (t 2 ) → t.A 1 = t 1 .A 1 ∧ t.A 2 = t 1 .A 2 ∧ ... ∧ t.A n = t 1 .A n ∧ t.B 1 = t 2 .B 1 ∧ t.B 2 = t 2 .B 2 ∧ ... ∧ t.B n = t 2 .B n )}-[σ Cond E] r = σ Cond [E] r -[ ∏ A1,A2,..,An E] r = ∏ A1,A2,..,An [E] r -[E 1 ∪ E 2 ] r = [E 1 ] r ∪ [E 2 ] r -[E 1 Cond E 2 ] r = [E 1 ] r Cond [E 2 ] r -[E 1 -E 2 ] r = [E 1 ] r -[E 2 ] r -[E 1 ∩ E 2 ] r = [E 1 ] r ∩ [E 2 ] r -[E 1 × E 2 ] r = [E 1 ] r × [E 2 ]-Union If Ω 1 ≡ [gp 1 ], Ω 2 ≡ [gp 2 ], gp 1 = tp 1 AN D tp 2 , gp 2 = tp 1 AN D tp 3 , and R 1 , R 2 , R 3 ∈ S such that tp 1 =?s 1 R 1 ?o 1 , tp 2 =?s 1 R 2 ?o 2 , tp 3 =?s 1 R 3 ?o 3 and R x = ∏ R1.s,R1.o (R 1 R 2 ) and R y = ∏ R1.s,R1.o (R 1 R 3 ), then RA ≡ SA. RA: R x ∪ R y = {t|(R x (t) ∨ R y (t)) ∧ (ξ(R x ) ≡ ξ(R y ))} SA: ∏ ?s1,?o1 (Ω 1 ∪ Ω 2 ) = {µ |s1,o1 |µ ∈ Ω 1 ∨ µ ∈ Ω 2 } -Join If Ω 1 ≡ [tp 1 ],Ω 2 ≡ [tp 2 ], Cond ≡ expr, and R 1 , R 2 ∈ S such that tp 1 =?s 1 R 1 ?o 1 , tp 2 =?s 1 R 2 ?o 2 then RA ≡ SA3. RA: R 1 Cond R 2 = {t|∀t 1 , ∀t 2 (R 1 (t 1 ) ∧ R 2 (t 2 ) → t.A 1 = t 1 .A 1 ∧ t.A 2 = t 1 .A 2 ∧ t.B 1 = t 2 .B 1 ∧ t.B 2 = t 2 .B 2 ∧ Cond(t))} SA1: Ω 1 Ω 2 = {µ 1 ∪ µ 2 |µ 1 ∈ Ω 1 , µ 2 ∈ Ω 2 are compatible mappings} SA2: F ILT ER expr Ω = {µ|µ ∈ Ω ∧ expr(µ)} SA3: F ILT ER expr (Ω 1 Ω 2 ) Semantics of SA3 is composed of semantics of SA1 and SA2. -Difference If Ω 1 ≡ [gp 1 ], Ω 2 ≡ [gp 2 ], R ∈ S such that tp 1 =?s 1 R?o 1 , tp 2 =?s 1 R?o 2 , gp 1 = F ILT ER ?o1 Op "V alue1 (tp 1 ), gp 2 = F ILT ER ?o1 Op "V alue2 (tp 2 ), R 1 = ∏ R.s (σ R.o Op "V alue1 (R)), R 2 = ∏ R.s (σ R.o Op "V alue2 (R)),

Related work

There have been several attempts to make RDBMS and RDF stores interoperate. The most popular one, D2RQ [3], has explored one direction i.e they construct an RDF schema from Relational data. A similar effort, W3C RDB2RDF WG [9], aims to generate RDF triples from one or more Relational tables without loss of information. In [8], the author describes a transformation from SPARQL to relational algebra and attempts to make existing work on query planning and optimization available to SPARQL implementors. RETRO provides the translation in the reverse direction and proves the translation to be semantics preserving by reusing the work done on compositional semantics of SPARQL in [12] and [5].

Compositional semantics of SPARQL discussed in [12] is reused and extended by authors in [5] by providing additional operators such as F ILT ER expr and SELECT (v 1 , v 2 ..v n ). However, unlike [12] and [5], we explicitly define compositional semantics in terms of valuation functions, which map the elements of syntax domain to elements of the semantic domain and use it towards the proof of semantics preservation of the query translation.

Conclusion and Future Work

In this paper, we have provided interoperability between RDF Stores and RDBMS by firstly deriving a relational schema from the RDF store, and secondly providing a means to translate an SQL query to a semantically equivalent SPARQL query. In future versions of RETRO, we plan to provide an algorithm for deriving a user friendly relational schema, and extend the query mapping algorithm with additional and advanced SQL operators such as aggregation. Another promising direction for future work is to leverage the query optimization available for SQL queries to generate efficient, equivalent SPARQL queries.

Fig. 1 .1Fig. 1. Schema and Query Mapping

Table 1 .1The RDF dataset DS = { name(s, o), age(s, o), sal(s, o), webPage(s, o),dept(s, o), phone(s, o), email(s, o) }(B1, name, paul)(B2, name, john)(B3, name, george)(B4, name, ringo)(B1, age, 30)(B2, age, 40)(B3, age, 25)(B4, age, 35)(B1, sal, 20,000)(B2, sal, 30,000)(B3, sal, 40,000)(B4, sal, 50,000)(B2, dept, Research)(B3, dept, Research)(B3, dept, Admin)(B4, dept, Admin)(B1, phone, 777-3426)(B4, phone, 888-4537)(B2, email, john@acd.edu)SELECT ?s ?onameWHERE { ?s name ?o }=>soB1paulB2johnB3georgeB4ringo

(B4, email, ringo@acd.edu) (B3, webPage, www.george.edu) (B4, webPage, www.starr.edu)

Table 2 .2A SPARQL query and relation instance obtained from its evaluation on D for predicate "name"

Table 3 .3Illustration of SQL-to-SPARQL translation on dataset D

--> CrossProductQuery | SPJQuery | UnionQuery |IntersectQuery | ExceptQuery.CrossProductQuery --> SelectClause FromClause.SPJQuery--> SelectClause FromClause WhereClause.UnionQuery--> SqlQuery 'UNION' SqlQuery.IntersectQuery--> SqlQuery 'INTERSECT' SqlQuery.ExceptQuery--> SqlQuery 'EXCEPT' SqlQuery.SelectClause--> 'SELECT' AttributeList.FromClause--> 'FROM' TableList.WhereClause--> 'WHERE' ExpressionList.AttributeList--> Attribute',' AttributeList | Attribute.Attribute--> String.TableList--> Table','TableList | Table.Table--> String.ExpressionList--> Expression Lop ExpressionList | Expression.Expression--> Attribute Op Attribute | Attribute Op Value.Value--> Numeral | String.Op--> '<' | '>' | '=' | '>=' | '<=' | '<>'.Lop--> 'AND' | 'OR' | 'NOT' .

which implies f needs an argument from domain D 1 and one argument from D 2 ,..., and one from D n to produce an answer belonging to domain A. • Description: A description of an operation's mapping is generally given as an equational definition. However, a set, a graph, a table or a diagram can also be used. In this paper we use equational definitions. -Valuation Function: A Valuation function is a collection of functions, one for each syntax domain. They map elements of syntax domain (Abstract syntax)to elements of semantic domain (Semantic algebra), for e.g., a valuation function can be used to map a query string (syntax domain) to its result relation (semantic domain).

Semantic Algebra for the domain of mapping setsSemantic Domain: Mapping Set, Ω.

1. σ Cond operator takes a relation as an input and retains only the tuples that satisfy the expression Cond and returns the resulting relation. Cond is a boolean expression defined in the abstract syntax of SQL. Its functionality is denoted by σ Cond : R → R and description by: σ Cond R = {t|R(t) ∧ Cond(t)} where R(t) denotes tuple t belongs to relation R.2. The Project operator's functionality is -∏ A1,A2,..,An R = {t.A 1 , t.A 2 , .., t.A n |R(t)} ∏ A1,A2,..,An : R → R, description is:3. Union operator's functionality is ∪ : R × R → R and description is:

where A 1 , A 2 , .., A n are attributes of relation R 1 and B 1 , B 2 , .., B n are attributes of R 2 . The valuation function [] r , maps an element of syntax domain E to an element of semantic domain R.Valuation functions[]

r : E → R ;

Given an RDF dataset D, an SQL query, sql ∈ SqlQuery, with an equivalent relational algebra expression RA ∈ E over a relational schema S derived from D, the translation of sql to an equivalent SPARQL query, sparql(or an equivalent sparql algebra expression SA), over dataset D, is a semantics preserving translation.Proof. A query is made up of operators and operands and two given queries are equivalent if they are made up of semantically equivalent operators applied on equivalent data in the same sequence. Having defined the compositional semantics for Relational Algebra and SPARQL, proving the semantic equivalence boils down to equating their respective semantic algebras. SA: P ROJECT v1,v2,..,vn Ω = {µ |v1,v2,..,vn |µ ∈ Ω}r Theorem 1. -Selection If Ω ≡ [tp ], Cond ≡ expr and R ∈ S such that tp = ?s R ?o then RA ≡ SA. RA:σ Cond R = {t|R(t) ∧ Cond(t)} SA: F ILT ER expr Ω = {µ|µ ∈ Ω ∧ expr(µ)}The conditions Ω ≡ [tp ]and R ∈ S such that tp = ?s R ?o ensure the data equivalence and RA ≡ SA stand for operator equivalence.-Projection If Ω ≡ [tp ], {∀i, A i ≡ v i , 1 ≤ i ≤ n}and R ∈ S such that tp = ?s R ?o then RA ≡ SA. RA: ∏ A1,A2,..,An R = {t.A 1 , t.A 2 , .., t.A n |R(t)}

expr ≡ ¬bound(?o 2 ) , then RA ≡ SA3.RA: R 1 -R 2 = {t|R 1 (t) ∧ ∀t 2 (R 2 (t 2 ) ∧ t = t 2 ) ∧ (ξ(R 1 ) ≡ ξ(R 2 ))} SA1: Ω 1 Ω 2 = {Ω 1 Ω 2 ) ∪ (Ω 1 Ω 2 )} SA2: F ILT ER expr Ω = {µ|µ ∈ Ω ∧ expr(µ)} SA3: ∏ ?s1 (F ILT ER expr (Ω 1 Ω 2 )) Semantics of SA3 is composed of semantics of SA1 and SA2. -Intersection If Ω 1 ≡ [gp 1 ], Ω 2 ≡ [gp 2 ], gp 1 = tp 1 AN D tp 2 , gp 2 = tp 1 AN D tp 3 , and R 1 , R 2 , R 3 ∈ S such that tp 1 =?s 1 R 1 ?o 1 , tp 2 =?s 1 R 2 ?o 2 , tp 3 =?s 1 R 3 ?o 3 and R x = ∏ R1.s,R1.o (R 1 R 2 ) and R y = ∏ R1.s,R1.o (R 1 R 3 ), then RA ≡ SA. RA: R x ∩ R y = {t|(R x (t) ∧ R y (t)) ∧ (ξ(R x ) ≡ ξ(R y ))} SA: ∏ ?s1,?o1 (Ω 1 Ω 2 ) = {(µ 1 ∪ µ 2 ) |s1,o1 |µ 1 ∈ Ω 1 , µ 2 ∈ Ω 2 are compatible mappings} -Cross product If Ω 1 ≡ [tp 1 ],Ω 2 ≡ [tp 2 ]and R 1 , R 2 ∈ S such that tp 1 = ?s 1 R 1 ?o 1 , tp 2 =?s 2 R 2 ?o 2 then RA ≡ SA. RA: R 1 × R 2 = {t|∀t 1 , ∀t 2 (R 1 (t 1 ) ∧ R 2 (t 2 ) → t.A 1 = t 1 .A 1 ∧ t.A 2 = t 1 .A 2 ∧ t.B 1 = t 2 .B 1 ∧ t.B 2 = t 2 .B 2 )} SA: Ω 1 Ω 2 = {µ 1 ∪ µ 2 |µ 1 ∈ Ω 1 , µ 2 ∈ Ω 2 arecompatible mappings} Since all the operators in relational algebra are proved equivalent to corresponding operators in SPARQL, it can be concluded that ∀sql ∈ SqlQuery, where RA is relational algebra expression of sql, [RA] r ≡ [sparql ].

This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-08-1-0260. We thank Dr. Herklotz (AFOSR) for his support.

Scalable Semantic Web Data Management Using Vertical Partitioning DJAbadi AMMadden SHollenbach KJ VLDB 2007 The Description Logic Handbook: Theory, Implementation, and Applications FBaader DCalvanese DLMcguinness DNardi PFPatel-Schneider Cambridge University Press 2003 CBizer RCyganiak D2RQ Lessons Learned Unifying Class-Based Representation Formalisms DCalvanese MLenzerini DNardi J. Artif. Intell. Res. (JAIR) 11 1999 Semantics preserving SPARQL-to-SQL translation AChebotko SLu FFotouhi Data Knowl. Eng 68 10 2009 Relational Completeness of Data Base Sublanguages EFCodd Database Systems: 65 RRustin

San Jose, California

Prentice Hall and IBM Research Report RJ 1972 987 98 Relational Completeness of Data Base Sublanguages EFCodd Database Systems: 65 RRustin

San Jose, California

Prentice Hall and IBM Research Report RJ 1972 987 98 RCyganiak A relational algebra for SPARQL 2005 AMalhotra W3C RDB2RDF Incubator Group Report 2009 Bridging the gap between OWL and relational databases BMotik IHorrocks USattler J. Web Sem 7 2 2009 Stream data analysis in Prolog DSParker 1990 MIT Press Cambridge, MA, USA Semantics and complexity of sparql JPérez MArenas CGutierrez ternational Semantic Web Conference 2006 SPARQL Query Language for RDF EPrud'hommeaux ASeaborne 2008 W3C Recommendation RETRO: A Framework for Semantics Preserving SQL-to-SPARQL Translation JRachapalli VKhadilkar MKantarcioglu BThuraisingham 2011 Database Management Systems RRamakrishnan JGehrke RRamakrishnan JGehrke Aug 2002 McGraw-Hill Science/Engineering/Math 3 edn