RETRO: A Framework for Semantics Preserving SQL-to-SPARQL Translation? Jyothsna Rachapalli, Vaibhav Khadilkar, Murat Kantarcioglu, and Bhavani Thuraisingham The University of Texas at Dallas, 800 West Campbell Road, Richardson, TX 75080-3021, USA {jxr061100,vvk072000,muratk,bxt043000}@utdallas.edu www.utdallas.edu Abstract. There have been several attempts to make RDBMS and RDF stores inter-operate. The most popular one, D2RQ, has explored one di- rection, i.e., to look at RDBMS through RDF lenses. In this paper we present RETRO, which investigates the reverse direction, i.e., to look at RDF through Relational lenses. RETRO generates a relational schema from an RDF store, enabling a user to query RDF data using SQL. A significant advantage of this direction in-addition to interoperability is that it makes numerous relational tools developed over the past several decades, available to the RDF stores. In order to provide interoperabil- ity between these two systems one needs to resolve the heterogeneity between their respective data models and include schema mapping, data mapping and query mapping in the transformation process [5]. However, like D2RQ, RETRO chooses not to physically transform the data and deals only with schema mapping and query mapping. RETRO’s schema mapping derives a domain specific relational schema from RDF data and its query mapping transforms an SQL query over the schema into a prov- ably equivalent SPARQL query, which in-turn is executed upon the RDF store. Since RETRO is a read-only framework, its query mapping uses only a relevant and relationally complete subset of SQL. A proof of cor- rectness of this transformation is given based on compositional semantics of the two query languages. Keywords: Database Interoperability, Query Translation, RDF, RDBMS, Semantic Web, SQL, SPARQL. 1 Introduction Why not relational databases for Semantic Web? Since its advent in the 1970’s, relational databases have given rise to numerous applications and tools that cater to a wide variety of user requirements. The ? This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-08-1-0260. We thank Dr. Herklotz (AFOSR) for his support. 2 RETRO: A Semantics Preserving SQL-to-SPARQL Translation reasons for the dominance of relational databases for the past several decades are not trivial. It has continually offered the best mix of simplicity, expressivity, performance and robustness in managing generic data. RDBMS has its logical foundations in first-order predicate logic and set theory. It is basically made up of a schema and instance data. The data in a DB is operated upon by relational algebra which can be considered as an abstraction of SQL in a broad sense. The most prominent feature of RDBMS is not reasoning about schema [10], but query answering over the data. Query answering in a database is not logical reasoning, but finite model checking where the model is the given database instance and represents exactly one interpretation [2, Chapter 2]. Consequently, absence of in- formation in a database instance is interpreted as negative information. In other words, traditional semantics in RDBMS is characterized as a closed-world seman- tics. However, over time, application requirements have evolved, giving rise to a need to support reasoning or entailment in query answering. A key example of this being World Wide Web or Semantic Web. Knowledge bases in Semantic Web are based on Description logics which generally are decidable fragments of first- order logic. Unlike RDBMS, query answering in Semantic Web knowledgebases uses reasoning or logical entailment by considering all possible interpretations, which essentially is its open-world semantics. In addition to query answering, nature of data is an additional factor that leads to choice of RDF model over relational model. The data pertaining to RDBMS applications is homogeneous in nature as it is usually a complete description of a particular domain. This makes it amenable to being modeled as entities and relationships with only oc- casional problems of null values. On the other hand, data in the case of Semantic Web applications is most likely an incomplete instance dataset and such infor- mation cannot be modeled as entities and relationships without giving rise to countless null values. Factors such as incompleteness of data, non-finiteness of domain, query answering based on entailment and support for rich expressivity make RDF/OWL a better choice over relational model/RDBMS for Semantic Web applications. Why RETRO, why bridge the gap in reverse direction? Owing to the time Relational databases have been around, they have developed into a mature technology, dominating the persistence mechanism market with several well-established vendors and standards such as SQL and JDBC, which are well defined and accepted. For the same reasons, experience base of develop- ers for relational databases significantly outnumber RDF specialists. There has been a considerable growth in RDF data and a lot of valuable data now resides in RDF stores. Since a vast majority of RDBMS developers may have little fa- miliarity and limited resources to invest in Semantic Web and understand its underlying principles, tools and technologies, it becomes imperative to provide them with a means to conveniently query the existing RDF data. In-order to build such a tool, one has to step into their shoes and assess their requirements as they think relationally and know little about the expressivity supported by RDF/OWL but are indeed looking forward to use/query the RDF data with RETRO: A Semantics Preserving SQL-to-SPARQL Translation 3 minimal effort. A tool like RETRO will enable one to effortlessly reuse existing data visualization, data reporting and data mining tools available for relational data. Additionally, commercial applications that use semantic web technologies are impeded because there are much fewer Business Intelligence tools available for them when compared with tools available for relational data. RETRO will not only help commercial applications use semantic web technologies but also serve to cordially welcome RDBMS users to the brave new world of semantic web. A curious RDBMS user may eventually be instigated to leverage the expres- sivity supported by SPARQL and get motivated to learn more about it and other semantic web technologies. It may thus eventually lead to higher acceptance of semantic web (SW) and bring more work force into SW research. Integrating various databases is often a motivation for adopting RDF [3] and this in turn also serves as a motivation for RETRO, which addresses aforementioned issues of RDBMS folks such as ease of adaptability, backwards compatibility and painless migration to newer DB/Web technologies (RDF/Semantic Web) by bridging the gap between the two fundamentally different data management systems. How is the gap bridged to achieve interoperability? RETRO facilitates interoperability between RDF stores and RDBMS without physically transforming the data. This interoperability is achieved in two steps, namely schema mapping and query mapping. In schema mapping, RETRO aims to provide RDBMS users access to RDF data by creating a relational schema that directly corresponds to the RDF data. The relational schema is obtained by virtually, vertically partitioning [1] RDF data based on predicates. The ratio- nale behind this approach is described in the section on schema mapping. The resulting relational schema consists of relations of degree two, i.e., the subject and object of a predicate form the attributes of these relations. The user can now create an SQL query over this relational schema. This SQL query is then converted into a semantically equivalent SPARQL query using RETRO’s Query mapping procedure. The SPARQL query is now executed over the RDF store and the results are presented to the user in relational format. Along the following lines we state some assumptions made for query mapping/translation. A query translator is not a query processor and is therefore not responsible for verifying semantic correctness of the source query and assumes it to be correct. Given a semantically correct source query, the translator transforms it from source lan- guage to target language by preserving the semantics. Figure 1 shows the schema and query mapping components of RETRO. Problem Scope: Formal semantics of SPARQL was first discussed in an influen- tial work by Perez et. al [12] in 2006 and was followed and adapted by the W3C standard specification of SPARQL [13] in 2008. In [12], the authors present alge- braic syntax with compositional semantics of SPARQL. We reuse both of these features of SPARQL as it is the target language of our query translation. In other words, we start with an algebraic syntax and compositional semantics of relational algebra and map it to the algebraic syntax and compositional seman- 4 RETRO: A Semantics Preserving SQL-to-SPARQL Translation Fig. 1. Schema and Query Mapping tics of SPARQL respectively. Since we reuse the compositional semantics from [12], the assumptions made in it also hold in this paper, which we briefly restate here. Algebraic formalization of the core fragment of SPARQL is done over sim- ple RDF graphs without RDFS vocabulary and literal rules, and set semantics is used instead of bag semantics as implied in the SPARQL specification. Problem Definition: Given an RDF dataset and a set of relational alge- bra operators, the goal is firstly to derive a relational schema that accurately represents the RDF data and secondly, to map each of the relational algebra operators (applied to operands from relational schema) to a semantically equiv- alent SPARQL algebra operator or a set of operators. The result set obtained upon SPARQL query evaluation on the RDF store is presented to the user as a relation. Contributions: We describe a means of computing domain specific rela- tional schema from a given RDF data-store. We define a relationally complete [6] core fragment of SQL/relational algebra keeping in mind the read-only as- pect of the application. We then describe compositional semantics of this core fragment of relational algebra. Subsequently, we describe compositional seman- tics of SPARQL based on work done in [12] and [5]. Finally, we give a proof of semantics preservation of the translation based on compositional semantics of the two query languages. The aforementioned abstractions, restrictions and as- sumptions made for query translation lend to the formulation of a simple proof of correctness, by retaining the core complexity of the problem. To the best of our knowledge, this paper presents the first formal semantics preserving translation from SQL to SPARQL. RETRO: A Semantics Preserving SQL-to-SPARQL Translation 5 2 RETRO Framework: Schema Mapping ER model and Description logics both model a given domain using classes and re- lationships. Calvanese el al., in [4] define a translation function that transforms ER model into a ALUN I, DL knowledge base [4] and they prove that this transformation is information preserving. Relational databases have ER model as the most popular conceptual data model and can be mapped to a descrip- tion language that closely corresponds to the constructs of ER model. However, in Semantic Web there are many flavors of description languages with varying expressivity and computational properties that can be used to model a given domain. It is not possible to map or transform every description language to ER model as they may have different expressivity, and this is the reason why usual ER to RDF mappings cannot be leveraged to do a reverse transformation in RETRO. In our framework, we derive a domain specific relational schema from the ABox part of the DL knowledge base and do not explicitly consider the TBox. The schema mapping algorithm traverses the entire set of ABox triples and extracts every unique predicate that is stored separately for later use. The transformation or derivation of relational schema from RDF store using the schema mapping algorithm is trivially an information preserving translation. Algorithm 1 Schema-Mapping() Input: ABox Output: A map, P (relation name → rank) 1: P ← ∅ {The map P is initially empty} 2: rank ← 1 3: for i ← 1 to ABox.size do 4: T ← ABox[i] {Retrieve a triple T from the ABox} 5: p ← T.predicate {Retrieve the predicate p from the triple T } 6: if p ∈ / P then 7: P.put(p, rank) {Add p to the map P with the current value of rank} 8: rank ← rank + 1 9: end if 10: end for 11: return P The map P returned by Algorithm 1 can be used to display the relation schema to users. A relation schema is constructed by extracting each of the predicate names from P and adding attributes s and o. A relation instance, pI , in a schema S, for a predicate p, can be constructed by the evaluation of a triple pattern, tp, such that tp.s =?si ∧ tp.p ∈ P ∧ tp.o =?oi , over a triple store. A pair of values (sj , oj ) that are bound to variables (?si , ?oi ) form a tuple in pI . Algorithm 1 also ranks the predicates in the order in which they are discovered. The significance of ranking will become evident in the next section. We now give an example of a RDF dataset D, which is an extension of the dataset from [12], and its corresponding relational schema S generated by 6 RETRO: A Semantics Preserving SQL-to-SPARQL Translation Algorithm 1. We also give an example of an instance of relation “name” that can be constructed from D. When the number of predicates in a ABox is very large, we envisage providing a user with an auxiliary schema for the sole purpose of understanding the structure of data to help them with query formulation. However, the implementation of such a schema is left as a part of future work. S = { name(s, o), age(s, o), sal(s, o), webPage(s, o), dept(s, o), phone(s, o), email(s, o) } (B1 , name, paul) (B2 , name, john) (B3 , name, george) (B4 , name, ringo) (B1 , age, 30) (B2 , age, 40) (B3 , age, 25) (B4 , age, 35) (B1 , sal, 20,000) (B2 , sal, 30,000) (B3 , sal, 40,000) (B4 , sal, 50,000) (B2 , dept, Research) (B3 , dept, Research) (B3 , dept, Admin) (B4 , dept, Admin) (B1 , phone, 777-3426) (B4 , phone, 888-4537) (B2 , email, john@acd.edu) (B4 , email, ringo@acd.edu) (B3 , webPage, www.george.edu) (B4 , webPage, www.starr.edu) Table 1. The RDF dataset D SELECT ?s ?o name WHERE { ?s name ?o } => s o B1 paul B2 john B3 george B4 ringo Table 2. A SPARQL query and relation instance obtained from its evaluation on D for predicate “name” We would now like to delve into how RETRO as a tool caters to knowledge evolution and ontology dynamics. Knowledge representation is an important aspect of knowledge. As mentioned earlier, over the past several decades ap- plication requirements have evolved and commensurately the ways to represent knowledge have also evolved. However, the gap between the users of the two knowledge representation methods, namely relational and RDF represenatation, still remains. Often, an analogy is established between databases and DL knowl- edgebases by comparing the schema of a database to a TBox and a database instance to an ABox. This analogy acts as a stepping stone towards the process of bridging the gap between them. RETRO supports evolution of knowledge representation by bridging the gap between the two knowledge representation paradigms through schema mapping and query mapping. RETRO will addition- ally facilitate application of existing database technologies for e.g., data analysis RETRO: A Semantics Preserving SQL-to-SPARQL Translation 7 tools used in stream databases [11], to RDF data. This will enable one to cap- ture and analyze the dynamics of RDF data sets using the aforementioned tools, which are much more advanced than the current Semantic Web state of the art. 3 RETRO Framework: Query Mapping In this section we first illustrate the SQL-to-SPARQL translation with some examples and then describe an abstract grammar for SQL which is used to build parse trees for a given SQL query. We then describe an algorithm that translates a given SQL query into a semantically equivalent SPARQL query. In our examples we use the RDF dataset D and its associated relational schema S that were defined in the previous section. In Table 3, we give a description of an SQL query on schema S, its equivalent SPARQL query over dataset D obtained from Algorithm 2 and subsequently, we show the result of evaluating the SPARQL query on D. The SQL queries use the following relational algebra operators: Join, Union and Difference. Additional example queries for other operators can be found in our technical report [14]. 3.1 Abstract Grammar for SQL As shown in the abstract grammar below, an input SQL query for the transla- tion can be a simple Cross product query, Select-Project-Join query or queries with Union, Intersection or Difference (Except) operators. Additionally, the Query-Mapping algorithm (Algorithm 2) supports equi-join and does not con- sider conditional join because of the nature of RDF data. Conditional joins are meaningful in relational data, however, since the relational schema is derived from RDF data and joins over the derived schema are usually performed over attributes which are URI’s, conditional joins do not come into picture [14]. SqlQuery --> CrossProductQuery | SPJQuery | UnionQuery | IntersectQuery | ExceptQuery. CrossProductQuery --> SelectClause FromClause. SPJQuery --> SelectClause FromClause WhereClause. UnionQuery --> SqlQuery ’UNION’ SqlQuery. IntersectQuery --> SqlQuery ’INTERSECT’ SqlQuery. ExceptQuery --> SqlQuery ’EXCEPT’ SqlQuery. SelectClause --> ’SELECT’ AttributeList. FromClause --> ’FROM’ TableList. WhereClause --> ’WHERE’ ExpressionList. AttributeList --> Attribute’,’ AttributeList | Attribute. Attribute --> String. TableList --> Table’,’TableList | Table. Table --> String. ExpressionList --> Expression Lop ExpressionList | Expression. Expression --> Attribute Op Attribute | Attribute Op Value. Value --> Numeral | String. Op --> ’<’ | ’>’ | ’=’ | ’>=’ | ’<=’ | ’<>’. Lop --> ’AND’ | ’OR’ | ’NOT’ . 8 RETRO: A Semantics Preserving SQL-to-SPARQL Translation Description SQL Query SPARQL Query ResultSet Find the email SELECT email.o SELECT ?o2 address of john FROM name, email WHERE { ?s1 name ?o1 . ?o2 WHERE ?s1 email ?o2 . john@acd.edu name.s = email.s FILTER ( ?o1 = john ) } AND name.o = john Names of all people SELECT name.o SELECT ?o1 who either have a FROM name, email WHERE { ?o1 phone number or an WHERE { ?s1 name ?o1 . paul email address name.s = email.s ?s1 email ?o2 . } john UNION UNION ringo SELECT name.o { ?s1 name ?o1 . FROM name, phone ?s1 phone ?o3 . } WHERE } name.s = phone.s Find the resources that SELECT dept.s SELECT ?s1 work for the Research FROM dept WHERE { { department but not WHERE ?s1 dept ?o1 . ?s1 the Admin department dept.o = Research FILTER B2 EXCEPT ( ?o1 = Research ) } SELECT dept.s OPTIONAL { FROM dept ?s1 dept ?o2 . WHERE FILTER ( ?o2 = Admin ) } dept.o = Admin FILTER ( !bound(?o2) ) Table 3. Illustration of SQL-to-SPARQL translation on dataset D 3.2 Query Mapping algorithms The conceptual evaluation strategy of an SQL query first computes the cross- product of the tables in the FROM list, then deletes the rows in the cross-product that fail the qualification conditions and finally deletes all the columns that do not appear in the SELECT list. Additionally, if DISTINCT is specified, duplicate rows are eliminated. Algorithm 2 is used to construct a SPARQL query and is based on the conceptual query evaluation strategy of SQL. It takes as input an SQL query string and a map P generated from Algorithm 1, and returns an equivalent SPARQL query string. The SQL query string is first used to generate a parse tree using function parse(SQLQuery). We then extract SQL SELECT, FROM, WHERE-JC (Join Conditions of the form: Attr Op Attr) and WHERE- RETRO: A Semantics Preserving SQL-to-SPARQL Translation 9 BE (Boolean Expressions of the form: Attr Op V alue) clause strings from the parse tree. The procedure then performs a query translation based on the type of a query. Algorithm 2 makes use of several sub-procedures to construct a SPARQL query. We briefly explain each of these sub-procedures below. An interested reader can find detailed explanations of these sub-procedures in our technical report [14]. Algorithm 2 Query-Mapping() Input: An SQL query, qin , and a map P Output: A SPARQL query, qout 1: qout =“” {A SPARQL query that is initially blank} 2: tree = parse(qin ) { A parse tree obtained by parsing qin } SELECT F ROM 3: qin = tree.getSelectClause(); qin = tree.getF romClause(); W HERE−JC W HERE−BE 4: qin = tree.getJoinConditions() qin = tree.getBooleanExpr() SELECT 5: qout =“SELECT ” qout W HERE =“W HERE { ” 6: T P ← ∅ { The map of triple patterns is initially empty} 7: if tree.type = CrossProductQuery then F ROM 8: T P = TransSQLFromClause(qin ) W HERE W HERE−JC W HERE−BE 9: qout + = TransSQLWhereClause(qin , qin , T P, P ) SELECT SELECT 10: qout + = TransSQLSelectClause(qin ,TP) 11: SELECT qout = qout W HERE + qout +“ }” 12: else if tree.type = SPJQuery then F ROM 13: T P = TransSQLFromClause(qin )) W HERE W HERE−JC W HERE−BE 14: qout + = TransSQLWhereClause(qin , qin , T P, P ) W HERE 15: T P = extractT P (qout ) SELECT SELECT 16: qout + = TransSQLSelectClause(qin ,TP) 17: SELECT qout = qout W HERE + qout +“ }” 18: else if tree.type = UnionQuery then 19: q1 = tree.lef tSubT ree(), q2 = tree.rightSubT ree() 20: q1out = Query-Mapping(q1 ), q2out = Query-Mapping(q2 ) 21: qout = Merge(q1out , q2out , “UNION”) 22: else if tree.type = IntersectQuery then 23: q1 = tree.lef tSubT ree(), q2 = tree.rightSubT ree() 24: q1out = Query-Mapping(q1 ), q2out = Query-Mapping(q2 ) 25: qout = Merge(q1out , q2out , “INTERSECT”) 26: else if tree.type = ExceptQuery then 27: q1 = tree.lef tSubT ree(), q2 = tree.rightSubT ree() 28: q1out = Query-Mapping(q1 ), q2out = Query-Mapping(q2 ) 29: qout = Merge(q1out , q2out , “EXCEPT”) 30: end if 31: return qout TransSQLFromClause: This sub-procedure takes as input a set of table names from a SQL FROM clause, R, and returns a map from table names to triple patterns, T P . For each table name that is present in a FROM clause, the sub-procedure generates a triple pattern of the form, ?si ri ?oi , where ri denotes 10 RETRO: A Semantics Preserving SQL-to-SPARQL Translation a given table name. A generated triple pattern is then added to map T P indexed by its corresponding table name. TransSQLWhereClause: This sub-procedure takes sets of join conditions JC and boolean expressions BE, a map T P from TransSQLFromClause and map P generated from Algorithm 1 as inputs and returns a SPARQL WHERE clause string. In this sub-procedure, we use the aforementioned rank- ing of relations to sort the set JC to ensure that a consistent join ordering is obtained. We present the following example to show the necessity of rank- ing relations: Consider relations name, age and dept with ranks 1, 2 and 3 and a query containing only join conditions “name.s = age.s AND dept.s = age.s”. If we do not rank relations, we would get the SPARQL WHERE clause: {?s1 name ?o1 . ?s3 age ?o2 . ?s3 dept ?o3 }. However, using a sorting procedure, the second join condition would become age.s = dept.s, since the rank of dept (3) was originally > the rank of age (2). The sorted join conditions now lead to the correct WHERE clause: {?s1 name ?o1 . ?s1 age ?o2 . ?s1 dept ?o3 }. Thus, the ranking of relations used in join re-ordering ensures generation of a consistent SPARQL WHERE clause. Note that the process of ranking relations and sort- ing join conditions is merely an implementation detail and has little conceptual significance. The sub-procedure then uses the sorted join conditions to generate a part of the SPARQL WHERE clause string. This is followed by generating filter conditions using triple patterns in T P and boolean expressions present in a SQL query. TransSQLSelectClause: This sub-procedure takes as input a set of SQL SELECT attributes, A, and map T P , and returns a SPARQL SELECT clause string. It generates the variables for SPARQL SELECT by iterating over the set A, where an attribute ai ∈ A is of the form “relationName.attributeName”. During an iteration, the relation name part of an attribute ai is used to access the corresponding triple pattern from T P and the attribute name part of ai is used to access the desired variable within this triple pattern. Merge: This sub-procedure is used to merge two generated SPARQL sub- query strings for SQL queries containing Union, Intersection or Difference op- erators. The sub-procedure first extracts the necessary components of both SPARQL sub-queries (SELECT clause, triple patterns and filter expressions in WHERE clause). Next, the two sub-queries are combined into a merged query using various operations depending on the operator in the original SQL query. These operations include concatenation of triple patterns of both sub-queries, variable renaming to ensure consistency and introduction of specific keywords. 4 Semantics Preserving SQL-To-SPARQL Translation An important principle of compositional semantics is that the semantics should be compositional i.e. the meaning of a program expression should be built out of the meanings of its sub-expressions. We present compositional semantics in terms of abstract syntax definition of the language, semantic algebras and the valuation functions. RETRO: A Semantics Preserving SQL-to-SPARQL Translation 11 – Abstract Syntax: It is commonly specified as BNF grammar and it maps expressions in the language to parse trees. A Syntax domain is the term used to denote a collection of values with common syntactic structure. – Semantic Algebra: A Semantic algebra consists of a semantic domain accom- panied by a corresponding set of operations. A semantic domain is a set of elements grouped together because they share some common property, for e.g., set of natural numbers, set of tuples in a relation etc. The operations are functions that need arguments from the domain to produce answers, for e.g., selection, projection, join, are meaningful operations over the do- main of relations. An Operation is defined in terms of its functionality and a description of its mapping. • Functionality: For a function f , its functionality is an expression from domain and co-domain denoted by f : D1 × D2 × ... × Dn → A, which implies f needs an argument from domain D1 and one argument from D2 ,..., and one from Dn to produce an answer belonging to domain A. • Description: A description of an operation’s mapping is generally given as an equational definition. However, a set, a graph, a table or a diagram can also be used. In this paper we use equational definitions. – Valuation Function: A Valuation function is a collection of functions, one for each syntax domain. They map elements of syntax domain (Abstract syntax) to elements of semantic domain (Semantic algebra), for e.g., a valuation function can be used to map a query string (syntax domain) to its result relation (semantic domain). 4.1 Compositional Semantics of SPARQL In this section, we first describe the related formal notation and then give ab- stract syntax, semantic algebra of mapping sets and valuation functions as the compositional semantics for SPARQL. – RDF terms: Denoted by T , it comprises of pairwise disjoint infinite sets of I, B and L (IRI’s, Blank nodes and Literals respectively). – Triple: A triple (s, p, o) ∈ (I ∪ B) × I × (I ∪ B ∪ L), where s is a subject, p is a predicate, and o is an object. – Triple Pattern: A triple pattern, tp, is a triple of the form (sp, pp, op) ∈ (I ∪ V ∪ L) × (I ∪ V ) × (I ∪ V ∪ L), where V is an infinite set of variables that is pairwise disjoint from the sets I, B, and L; and sp, pp, and op are a subject pattern, predicate pattern, and object pattern respectively. – A mapping µ is a partial function µ : V → T . Given a triple pattern t, µ(t) denotes the triple obtained by replacing the variables in t according to µ. Domain of µ, dom(µ) is the subset of V where µ is defined and Ω is a set of mappings µ. Two mappings, µ1 and µ2 are compatible when for all x ∈ dom(µ1 ) ∩ dom(µ2 ), it is the case that µ1 (x) = µ2 (x). Abstract Syntax for SPARQL The abstract syntax shown below describes the structure of graph pattern ex- pressions. 12 RETRO: A Semantics Preserving SQL-to-SPARQL Translation gp ∈ graphpatterns, where gp represents the syntax domain of graph patterns gp ::= tp | gp F ILT ER expr | P ROJECTv1 ,v2 ,..,vn gp | gp U N ION gp | gp AN D gp | gp OP T gp Semantic Algebra for the domain of mapping sets Semantic Domain: Mapping Set, Ω. Operations defined over the semantic domain Ω are: 1. The σexpr operator takes a mapping set as an input and retains only the mappings that satisfy the expression expr and returns the resulting mapping set. Its functionality is denoted by σexpr : Ω → Ω and description by: – σ∏expr Ω = {µ|µ ∈ Ω ∧ µ  expr} 2. The vars operator takes an input mapping set and retains only the variables belonging to the set vars in each ∏ mapping and returns the resulting mapping Its functionality is given by v1 ,v2 ,..,vn : Ω → Ω and description by: set. ∏ – v1 ,v2 ,..,vn Ω = {µ|v1 ,v2 ,..,vn |µ ∈ Ω} 3. The Union operator ∪, takes as input two mapping sets and computes the set union and returns the resulting mapping set. Its functionality is given by ∪ : Ω × Ω → Ω and description by: – Ω1 ∪ Ω2 = {µ|µ ∈ Ω1 ∨ µ ∈ Ω2 } 4. The Join operator’s functionality is given by o n: Ω × Ω → Ω, description by: – Ω1 o n Ω2 = {µ1 ∪ µ2 |µ1 ∈ Ω1 , µ2 ∈ Ω2 are compatible mappings} 5. Difference operator’s functionality is r : Ω × Ω → Ω, description is: – Ω1 r Ω2 = {µ ∈ Ω1 | for all µ0 ∈ Ω2 , µ and µ0 are not compatible } 6. The Optional operator’s functionality is : Ω × Ω → Ω, description is: – Ω1 Ω2 = {(Ω1 o n Ω2 ) ∪ (Ω1 r Ω2 )} Valuation functions []: gp → Ω ; The valuation function [], maps an element of gp to Ω, i.e, it maps the language’s abstract syntax structures to meanings drawn from semantic domains. The domain of a valuation function is a set of derivation trees of the language and is defined structurally. It determines meanings of its subtrees and combines them into meaning of the entire tree. – [tp ]= {µ|dom(µ) = var(tp) ∧ µ(tp) ∈ G} – [gp F ILT ER expr]= σexpr∏[gp ] – [P ROJECTv1 ,v2 ,..,vn gp]= v1 ,v2 ,..,vn [gp ] – [gp1 U N ION gp2 ]= [gp1 ]∪ [gp2 ] – [gp1 AN D gp2 ]= [gp1 ]o n [gp2 ] – [gp1 OP T gp2 ]= [gp1 ] [gp2 ] The meaning of expression [gp1 U N ION gp2 ]is computed by first evaluating [gp1 ]to get Ω1 and [gp2 ]to get Ω2 and then computing Ω1 ∪Ω2 using the mapping set’s semantic algebra operation union, defined earlier. Given a mapping µ and expression expr, µ satisfies expr (µ  expr), iff (i) expr is bound(?X) and ?X ∈ dom(µ); (ii) expr is ?X op c, ?X ∈ dom(µ) and µ(?X) op c; (iii) expr is ?X op ?Y , ?X, ?Y ∈ dom(µ), and µ(?X) op µ(?Y ), where op ∈ {< | ≤ | ≥ |>| =}; (iv) expr is (¬expr1 ) and it is not the case that µ  expr1 ; (v) expr is (expr1 ∧ expr2 ) and µ  expr1 and µ  expr2 ; (vi) expr is (expr1 ∨ expr2 ) and µ  expr1 or µ  expr2 ; RETRO: A Semantics Preserving SQL-to-SPARQL Translation 13 4.2 Compositional Semantics of Relational Algebra The close relationship between SQL and Relational Algebra is the basis for query optimization in a RDBMS [15], and in our work it forms the basis for query trans- lation. In order to simplify the proof, we use relational algebra based syntax and define its compositional semantics by using tuple relational calculus formulas since they closely mirror the mapping based semantics of SPARQL. The proof of equivalence between relational algebra and relational calculus presented in [7] enables us to define the compositional semantics in this manner. Abstract Syntax for Relational Algebra The abstract syntax describes the structure of relational algebra (RA) expressions. E ∈ Expression, ∏ where E represents the syntax domain of RA expressions. E → σCond E| A1 ,A2 ,..,An E|E1 ∪ E2 |E1 o nCond E2 |E1 –E2 |E1 ∩ E2 |E1 × E2 Semantic Algebra for the domain of relations Semantic Domain: Relation, R ; A relation is a set of all tuples such that they are all defined over the same schema or set of attributes. Operations defined over the semantic domain R are: 1. σCond operator takes a relation as an input and retains only the tuples that satisfy the expression Cond and returns the resulting relation. Cond is a boolean expression defined in the abstract syntax of SQL. Its functionality is denoted by σCond : R → R and description by: – σCond R = {t|R(t) ∧ Cond(t)} where R(t) denotes tuple t belongs to relation R. ∏ 2. The∏ Project operator’s functionality is A1 ,A2 ,..,An : R → R, description is: – A1 ,A2 ,..,An R = {t.A1 , t.A2 , .., t.An |R(t)} 3. Union operator’s functionality is ∪ : R × R → R and description is: – R1 ∪ R2 = {t|(R1 (t) ∨ R2 (t)) ∧ (ξ(R1 ) ≡ ξ(R2 ))} A Union operation can be performed only over union compatible re- lations i.e. their schemas should be identical and it is denoted by the condition (ξ(R1 ) ≡ ξ(R2 )) where ξ(R) stands for schema of a relation R. 4. The Join operator’s functionality is o nCond : R1 × R2 → R and description is: – R1 o nCond R2 = {t|∀t1 , ∀t2 (R1 (t1 ) ∧ R2 (t2 ) → t.A1 = t1 .A1 ∧ t.A2 = t1 .A2 ∧ ... ∧ t.An = t1 .An ∧ t.B1 = t2 .B1 ∧ t.B2 = t2 .B2 ∧ ... ∧ t.Bn = t2 .Bn ∧ Cond(t))} 5. Difference operator’s functionality is – : R × R → R and description is: – R1 –R2 = {t|R1 (t) ∧ ∀t2 (R2 (t2 ) → t 6= t2 )} 6. Intersection operator’s functionality is ∩ : R × R → R and description is: – R1 ∩ R2 = {t|(R1 (t) ∧ R2 (t)) ∧ (ξ(R1 ) ≡ ξ(R2 ))} 7. Cross product operator’s functionality is × : R × R → R and description is: – R1 × R2 = {t|∀t1 , ∀t2 (R1 (t1 ) ∧ R2 (t2 ) → t.A1 = t1 .A1 ∧ t.A2 = t1 .A2 ∧ ... ∧ t.An = t1 .An ∧ t.B1 = t2 .B1 ∧ t.B2 = t2 .B2 ∧ ... ∧ t.Bn = t2 .Bn )} where A1 , A2 , .., An are attributes of relation R1 and B1 , B2 , .., Bn are attributes of R2 . 14 RETRO: A Semantics Preserving SQL-to-SPARQL Translation Valuation functions []r : E → R ; The valuation function []r , maps an element of syntax domain E to an element of semantic domain R. – [σ ∏Cond E ]r = σCond [∏ E ]r – [ A1 ,A2 ,..,An E ]r = A1 ,A2 ,..,An [E ]r – [E1 ∪ E2 ]r = [E1 ]r ∪ [E2 ]r – [E 1 o nCond E2 ]r = [E1 ]r o nCond [E2 ]r – [E1 –E2 ]r = [E1 ]r – [E2 ]r – [E1 ∩ E2 ]r = [E1 ]r ∩ [E2 ]r – [E 1 × E 2 ]r = [E 1 ]r × [E 2 ]r Theorem 1. Given an RDF dataset D, an SQL query, sql ∈ SqlQuery, with an equivalent relational algebra expression RA ∈ E over a relational schema S derived from D, the translation of sql to an equivalent SPARQL query, sparql(or an equivalent sparql algebra expression SA), over dataset D, is a semantics pre- serving translation. Proof. A query is made up of operators and operands and two given queries are equivalent if they are made up of semantically equivalent operators applied on equivalent data in the same sequence. Having defined the compositional seman- tics for Relational Algebra and SPARQL, proving the semantic equivalence boils down to equating their respective semantic algebras. – Selection If Ω ≡ [tp ], Cond ≡ expr and R ∈ S such that tp = ?s R ?o then RA ≡ SA. RA: σCond R = {t|R(t) ∧ Cond(t)} SA: F ILT ERexpr Ω = {µ|µ ∈ Ω ∧ expr(µ)} The conditions Ω ≡ [tp ]and R ∈ S such that tp = ?s R ?o ensure the data equivalence and RA ≡ SA stand for operator equivalence. – Projection If Ω ≡ [tp ], {∀i, Ai ≡ vi , 1 ≤ i ≤ n} and R ∈ S such that tp = ?s R ∏?o then RA ≡ SA. RA: A1 ,A2 ,..,An R = {t.A1 , t.A2 , .., t.An |R(t)} SA: P ROJECTv1 ,v2 ,..,vn Ω = {µ|v1 ,v2 ,..,vn |µ ∈ Ω} – Union If Ω1 ≡ [gp1 ], Ω2 ≡ [gp2 ], gp1 = tp1 AN D tp2 , gp2 = tp1 AN D tp3 , ∏3 ∈ S such that tp1 =?s1 R1 ?o1 , tp2∏=?s1 R2 ?o2 , tp3 =?s1 R3 ?o3 and R1 , R2 , R and Rx = R1 .s,R1 .o (R1 o n R2 ) and Ry = R1 .s,R1 .o (R1 o n R3 ), then RA ≡ SA. RA: ∏ Rx ∪ Ry = {t|(Rx (t) ∨ Ry (t)) ∧ (ξ(Rx ) ≡ ξ(Ry ))} SA: ?s1 ,?o1 (Ω1 ∪ Ω2 ) = {µ|s1,o1 |µ ∈ Ω1 ∨ µ ∈ Ω2 } – Join If Ω1 ≡ [tp1 ],Ω2 ≡ [tp2 ], Cond ≡ expr, and R1 , R2 ∈ S such that tp1 =?s1 R1 ?o1 , tp2 =?s1 R2 ?o2 then RA ≡ SA3. RA: R1 o nCond R2 = {t|∀t1 , ∀t2 (R1 (t1 ) ∧ R2 (t2 ) → t.A1 = t1 .A1 ∧ t.A2 = t1 .A2 ∧ t.B1 = t2 .B1 ∧ t.B2 = t2 .B2 ∧ Cond(t))} SA1: Ω1 o n Ω2 = {µ1 ∪ µ2 |µ1 ∈ Ω1 , µ2 ∈ Ω2 are compatible mappings} SA2: F ILT ERexpr Ω = {µ|µ ∈ Ω ∧ expr(µ)} SA3: F ILT ERexpr (Ω1 o n Ω2 ) Semantics of SA3 is composed of semantics of SA1 and SA2. RETRO: A Semantics Preserving SQL-to-SPARQL Translation 15 – Difference If Ω1 ≡ [gp1 ], Ω2 ≡ [gp2 ], R ∈ S such that tp1 =?s1 R?o1 , tp2 =?s∏1 R?o2 , gp1 = F ILT ER?o1 Op “V alue1 ∏ 00 (tp1 ), gp2 = F ILT ER?o1 Op “V alue200 (tp2 ), R1 = R.s (σR.o Op “V alue100 (R)), R2 = R.s (σR.o Op “V alue200 (R)), expr ≡ ¬bound(?o2 ) , then RA ≡ SA3. RA: R1 –R2 = {t|R1 (t) ∧ ∀t2 (R2 (t2 ) ∧ t 6= t2 ) ∧ (ξ(R1 ) ≡ ξ(R2 ))} SA1: Ω1 Ω2 = {Ω1 o n Ω2 ) ∪ (Ω1 r Ω2 )} SA2: F ∏ ILT ER expr Ω = {µ|µ ∈ Ω ∧ expr(µ)} SA3: ?s1 (F ILT ERexpr (Ω1 Ω2 )) Semantics of SA3 is composed of semantics of SA1 and SA2. – Intersection If Ω1 ≡ [gp1 ], Ω2 ≡ [gp2 ], gp1 = tp1 AN D tp2 , gp2 = tp1 AN D tp3 , and R1 , R2 , ∏R3 ∈ S such that tp1 =?s1 R1 ?o1 ,∏tp2 =?s1 R2 ?o2 , tp3 =?s1 R3 ?o3 and Rx = R1 .s,R1 .o (R1 o n R2 ) and Ry = R1 .s,R1 .o (R1 o n R3 ), then RA ≡ SA. RA: ∏ Rx ∩ Ry = {t|(Rx (t) ∧ Ry (t)) ∧ (ξ(Rx ) ≡ ξ(Ry ))} SA: ?s1 ,?o1 (Ω1 o n Ω2 ) = {(µ1 ∪ µ2 )|s1,o1 |µ1 ∈ Ω1 , µ2 ∈ Ω2 are compatible mappings} – Cross product If Ω1 ≡ [tp1 ],Ω2 ≡ [tp2 ]and R1 , R2 ∈ S such that tp1 = ?s1 R1 ?o1 , tp2 =?s2 R2 ?o2 then RA ≡ SA. RA: R1 × R2 = {t|∀t1 , ∀t2 (R1 (t1 ) ∧ R2 (t2 ) → t.A1 = t1 .A1 ∧ t.A2 = t1 .A2 ∧ t.B1 = t2 .B1 ∧ t.B2 = t2 .B2 )} SA: Ω1 o n Ω2 = {µ1 ∪ µ2 |µ1 ∈ Ω1 , µ2 ∈ Ω2 are compatible mappings} Since all the operators in relational algebra are proved equivalent to correspond- ing operators in SPARQL, it can be concluded that ∀sql ∈ SqlQuery, where RA is relational algebra expression of sql, [RA]r ≡ [sparql ]. Further explanation of the proof can be found in [14]. 5 Related work There have been several attempts to make RDBMS and RDF stores inter- operate. The most popular one, D2RQ [3], has explored one direction i.e they construct an RDF schema from Relational data. A similar effort, W3C RDB2RDF WG [9], aims to generate RDF triples from one or more Relational tables without loss of information. In [8], the author describes a transformation from SPARQL to relational algebra and attempts to make existing work on query planning and optimization available to SPARQL implementors. RETRO provides the transla- tion in the reverse direction and proves the translation to be semantics preserving by reusing the work done on compositional semantics of SPARQL in [12] and [5]. Compositional semantics of SPARQL discussed in [12] is reused and extended by authors in [5] by providing additional operators such as F ILT ERexpr and SELECT (v1 , v2 ..vn ). However, unlike [12] and [5], we explicitly define compo- sitional semantics in terms of valuation functions, which map the elements of syntax domain to elements of the semantic domain and use it towards the proof of semantics preservation of the query translation. 16 RETRO: A Semantics Preserving SQL-to-SPARQL Translation 6 Conclusion and Future Work In this paper, we have provided interoperability between RDF Stores and RDBMS by firstly deriving a relational schema from the RDF store, and secondly provid- ing a means to translate an SQL query to a semantically equivalent SPARQL query. In future versions of RETRO, we plan to provide an algorithm for deriving a user friendly relational schema, and extend the query mapping algorithm with additional and advanced SQL operators such as aggregation. Another promising direction for future work is to leverage the query optimization available for SQL queries to generate efficient, equivalent SPARQL queries. References 1. Abadi, D.J., 0002, A.M., Madden, S., Hollenbach, K.J.: Scalable Semantic Web Data Management Using Vertical Partitioning. In: VLDB. pp. 411–422 (2007) 2. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation, and Applica- tions. Cambridge University Press (2003) 3. Bizer, C., Cyganiak, R.: D2RQ Lessons Learned. http://www.w3.org/2007/03/ RdfRDB/papers/d2rq-positionpaper/ 4. Calvanese, D., Lenzerini, M., Nardi, D.: Unifying Class-Based Representation For- malisms. J. Artif. Intell. Res. (JAIR) 11, 199–240 (1999) 5. Chebotko, A., Lu, S., Fotouhi, F.: Semantics preserving SPARQL-to-SQL transla- tion. Data Knowl. Eng. 68(10), 973–1000 (2009) 6. Codd, E.F.: Relational Completeness of Data Base Sublanguages. In: R. Rustin (ed.): Database Systems: 65-98, Prentice Hall and IBM Research Report RJ 987, San Jose, California (1972) 7. Codd, E.F.: Relational Completeness of Data Base Sublanguages. In: R. Rustin (ed.): Database Systems: 65-98, Prentice Hall and IBM Research Report RJ 987, San Jose, California (1972) 8. Cyganiak, R.: A relational algebra for SPARQL (2005), http://www.hpl.hp.com/ techreports/2005/HPL-2005-170.html 9. Malhotra, A.: W3C RDB2RDF Incubator Group Report. http://www.w3.org/ 2005/Incubator/rdb2rdf/XGR-rdb2rdf-20090126/ (2009) 10. Motik, B., Horrocks, I., Sattler, U.: Bridging the gap between OWL and relational databases. J. Web Sem. 7(2), 74–89 (2009) 11. Parker, D.S.: Stream data analysis in Prolog, pp. 249–312. MIT Press, Cambridge, MA, USA (1990), http://dl.acm.org/citation.cfm?id=103477.103548 12. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of sparql. In: In- ternational Semantic Web Conference. pp. 30–43 (2006) 13. Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF. W3C Rec- ommendation (2008) 14. Rachapalli, J., Khadilkar, V., Kantarcioglu, M., Thuraisingham, B.: RETRO: A Framework for Semantics Preserving SQL-to-SPARQL Transla- tion. http://www.utdallas.edu/~vvk072000/Research/Technical-Report/ technical-report.html (2011) 15. Ramakrishnan, R., Gehrke, J., Ramakrishnan, R., Gehrke, J.: Database Man- agement Systems. McGraw-Hill Science/Engineering/Math, 3 edn. (Aug 2002), http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20\&path= ASIN/0072465638