=Paper=
{{Paper
|id=None
|storemode=property
|title=RETRO: A Framework for Semantics Preserving SQL-to-SPARQL Translation
|pdfUrl=https://ceur-ws.org/Vol-784/evodyn3.pdf
|volume=Vol-784
}}
==RETRO: A Framework for Semantics Preserving SQL-to-SPARQL Translation==
<pdf width="1500px">https://ceur-ws.org/Vol-784/evodyn3.pdf</pdf>
<pre>
RETRO: A Framework for Semantics Preserving
       SQL-to-SPARQL Translation?

        Jyothsna Rachapalli, Vaibhav Khadilkar, Murat Kantarcioglu, and
                             Bhavani Thuraisingham

                        The University of Texas at Dallas,
             800 West Campbell Road, Richardson, TX 75080-3021, USA
             {jxr061100,vvk072000,muratk,bxt043000}@utdallas.edu
                               www.utdallas.edu


        Abstract. There have been several attempts to make RDBMS and RDF
        stores inter-operate. The most popular one, D2RQ, has explored one di-
        rection, i.e., to look at RDBMS through RDF lenses. In this paper we
        present RETRO, which investigates the reverse direction, i.e., to look at
        RDF through Relational lenses. RETRO generates a relational schema
        from an RDF store, enabling a user to query RDF data using SQL. A
        signiﬁcant advantage of this direction in-addition to interoperability is
        that it makes numerous relational tools developed over the past several
        decades, available to the RDF stores. In order to provide interoperabil-
        ity between these two systems one needs to resolve the heterogeneity
        between their respective data models and include schema mapping, data
        mapping and query mapping in the transformation process [5]. However,
        like D2RQ, RETRO chooses not to physically transform the data and
        deals only with schema mapping and query mapping. RETRO’s schema
        mapping derives a domain speciﬁc relational schema from RDF data and
        its query mapping transforms an SQL query over the schema into a prov-
        ably equivalent SPARQL query, which in-turn is executed upon the RDF
        store. Since RETRO is a read-only framework, its query mapping uses
        only a relevant and relationally complete subset of SQL. A proof of cor-
        rectness of this transformation is given based on compositional semantics
        of the two query languages.

        Keywords: Database Interoperability, Query Translation, RDF, RDBMS,
        Semantic Web, SQL, SPARQL.


1     Introduction

Why not relational databases for Semantic Web?
Since its advent in the 1970’s, relational databases have given rise to numerous
applications and tools that cater to a wide variety of user requirements. The
?
    This material is based upon work supported by the Air Force Oﬃce of Scientiﬁc
    Research under award number FA9550-08-1-0260. We thank Dr. Herklotz (AFOSR)
    for his support.
2       RETRO: A Semantics Preserving SQL-to-SPARQL Translation

reasons for the dominance of relational databases for the past several decades
are not trivial. It has continually oﬀered the best mix of simplicity, expressivity,
performance and robustness in managing generic data. RDBMS has its logical
foundations in ﬁrst-order predicate logic and set theory. It is basically made up
of a schema and instance data. The data in a DB is operated upon by relational
algebra which can be considered as an abstraction of SQL in a broad sense. The
most prominent feature of RDBMS is not reasoning about schema [10], but query
answering over the data. Query answering in a database is not logical reasoning,
but ﬁnite model checking where the model is the given database instance and
represents exactly one interpretation [2, Chapter 2]. Consequently, absence of in-
formation in a database instance is interpreted as negative information. In other
words, traditional semantics in RDBMS is characterized as a closed-world seman-
tics. However, over time, application requirements have evolved, giving rise to a
need to support reasoning or entailment in query answering. A key example of
this being World Wide Web or Semantic Web. Knowledge bases in Semantic Web
are based on Description logics which generally are decidable fragments of ﬁrst-
order logic. Unlike RDBMS, query answering in Semantic Web knowledgebases
uses reasoning or logical entailment by considering all possible interpretations,
which essentially is its open-world semantics. In addition to query answering,
nature of data is an additional factor that leads to choice of RDF model over
relational model. The data pertaining to RDBMS applications is homogeneous
in nature as it is usually a complete description of a particular domain. This
makes it amenable to being modeled as entities and relationships with only oc-
casional problems of null values. On the other hand, data in the case of Semantic
Web applications is most likely an incomplete instance dataset and such infor-
mation cannot be modeled as entities and relationships without giving rise to
countless null values. Factors such as incompleteness of data, non-ﬁniteness of
domain, query answering based on entailment and support for rich expressivity
make RDF/OWL a better choice over relational model/RDBMS for Semantic
Web applications.

Why RETRO, why bridge the gap in reverse direction?
Owing to the time Relational databases have been around, they have developed
into a mature technology, dominating the persistence mechanism market with
several well-established vendors and standards such as SQL and JDBC, which
are well deﬁned and accepted. For the same reasons, experience base of develop-
ers for relational databases signiﬁcantly outnumber RDF specialists. There has
been a considerable growth in RDF data and a lot of valuable data now resides
in RDF stores. Since a vast majority of RDBMS developers may have little fa-
miliarity and limited resources to invest in Semantic Web and understand its
underlying principles, tools and technologies, it becomes imperative to provide
them with a means to conveniently query the existing RDF data. In-order to
build such a tool, one has to step into their shoes and assess their requirements
as they think relationally and know little about the expressivity supported by
RDF/OWL but are indeed looking forward to use/query the RDF data with
             RETRO: A Semantics Preserving SQL-to-SPARQL Translation             3

minimal eﬀort. A tool like RETRO will enable one to eﬀortlessly reuse existing
data visualization, data reporting and data mining tools available for relational
data. Additionally, commercial applications that use semantic web technologies
are impeded because there are much fewer Business Intelligence tools available
for them when compared with tools available for relational data. RETRO will
not only help commercial applications use semantic web technologies but also
serve to cordially welcome RDBMS users to the brave new world of semantic
web. A curious RDBMS user may eventually be instigated to leverage the expres-
sivity supported by SPARQL and get motivated to learn more about it and other
semantic web technologies. It may thus eventually lead to higher acceptance of
semantic web (SW) and bring more work force into SW research. Integrating
various databases is often a motivation for adopting RDF [3] and this in turn
also serves as a motivation for RETRO, which addresses aforementioned issues of
RDBMS folks such as ease of adaptability, backwards compatibility and painless
migration to newer DB/Web technologies (RDF/Semantic Web) by bridging the
gap between the two fundamentally diﬀerent data management systems.

How is the gap bridged to achieve interoperability?
RETRO facilitates interoperability between RDF stores and RDBMS without
physically transforming the data. This interoperability is achieved in two steps,
namely schema mapping and query mapping. In schema mapping, RETRO aims
to provide RDBMS users access to RDF data by creating a relational schema
that directly corresponds to the RDF data. The relational schema is obtained
by virtually, vertically partitioning [1] RDF data based on predicates. The ratio-
nale behind this approach is described in the section on schema mapping. The
resulting relational schema consists of relations of degree two, i.e., the subject
and object of a predicate form the attributes of these relations. The user can
now create an SQL query over this relational schema. This SQL query is then
converted into a semantically equivalent SPARQL query using RETRO’s Query
mapping procedure. The SPARQL query is now executed over the RDF store
and the results are presented to the user in relational format. Along the following
lines we state some assumptions made for query mapping/translation. A query
translator is not a query processor and is therefore not responsible for verifying
semantic correctness of the source query and assumes it to be correct. Given a
semantically correct source query, the translator transforms it from source lan-
guage to target language by preserving the semantics. Figure 1 shows the schema
and query mapping components of RETRO.

Problem Scope: Formal semantics of SPARQL was ﬁrst discussed in an inﬂuen-
tial work by Perez et. al [12] in 2006 and was followed and adapted by the W3C
standard speciﬁcation of SPARQL [13] in 2008. In [12], the authors present alge-
braic syntax with compositional semantics of SPARQL. We reuse both of these
features of SPARQL as it is the target language of our query translation. In
other words, we start with an algebraic syntax and compositional semantics of
relational algebra and map it to the algebraic syntax and compositional seman-
4       RETRO: A Semantics Preserving SQL-to-SPARQL Translation


                       Fig. 1. Schema and Query Mapping


tics of SPARQL respectively. Since we reuse the compositional semantics from
[12], the assumptions made in it also hold in this paper, which we brieﬂy restate
here. Algebraic formalization of the core fragment of SPARQL is done over sim-
ple RDF graphs without RDFS vocabulary and literal rules, and set semantics
is used instead of bag semantics as implied in the SPARQL speciﬁcation.
    Problem Definition: Given an RDF dataset and a set of relational alge-
bra operators, the goal is ﬁrstly to derive a relational schema that accurately
represents the RDF data and secondly, to map each of the relational algebra
operators (applied to operands from relational schema) to a semantically equiv-
alent SPARQL algebra operator or a set of operators. The result set obtained
upon SPARQL query evaluation on the RDF store is presented to the user as a
relation.
    Contributions: We describe a means of computing domain speciﬁc rela-
tional schema from a given RDF data-store. We deﬁne a relationally complete
[6] core fragment of SQL/relational algebra keeping in mind the read-only as-
pect of the application. We then describe compositional semantics of this core
fragment of relational algebra. Subsequently, we describe compositional seman-
tics of SPARQL based on work done in [12] and [5]. Finally, we give a proof of
semantics preservation of the translation based on compositional semantics of
the two query languages. The aforementioned abstractions, restrictions and as-
sumptions made for query translation lend to the formulation of a simple proof of
correctness, by retaining the core complexity of the problem. To the best of our
knowledge, this paper presents the ﬁrst formal semantics preserving translation
from SQL to SPARQL.
              RETRO: A Semantics Preserving SQL-to-SPARQL Translation                5

2    RETRO Framework: Schema Mapping
ER model and Description logics both model a given domain using classes and re-
lationships. Calvanese el al., in [4] deﬁne a translation function that transforms
ER model into a ALUN I, DL knowledge base [4] and they prove that this
transformation is information preserving. Relational databases have ER model
as the most popular conceptual data model and can be mapped to a descrip-
tion language that closely corresponds to the constructs of ER model. However,
in Semantic Web there are many ﬂavors of description languages with varying
expressivity and computational properties that can be used to model a given
domain. It is not possible to map or transform every description language to
ER model as they may have diﬀerent expressivity, and this is the reason why
usual ER to RDF mappings cannot be leveraged to do a reverse transformation
in RETRO. In our framework, we derive a domain speciﬁc relational schema
from the ABox part of the DL knowledge base and do not explicitly consider the
TBox. The schema mapping algorithm traverses the entire set of ABox triples
and extracts every unique predicate that is stored separately for later use. The
transformation or derivation of relational schema from RDF store using the
schema mapping algorithm is trivially an information preserving translation.


Algorithm 1 Schema-Mapping()
Input: ABox
Output: A map, P (relation name → rank)
 1: P ← ∅ {The map P is initially empty}
 2: rank ← 1
 3: for i ← 1 to ABox.size do
 4:   T ← ABox[i] {Retrieve a triple T from the ABox}
 5:   p ← T.predicate {Retrieve the predicate p from the triple T }
 6:   if p ∈
           / P then
 7:      P.put(p, rank) {Add p to the map P with the current value of rank}
 8:      rank ← rank + 1
 9:   end if
10: end for
11: return P


    The map P returned by Algorithm 1 can be used to display the relation
schema to users. A relation schema is constructed by extracting each of the
predicate names from P and adding attributes s and o. A relation instance, pI ,
in a schema S, for a predicate p, can be constructed by the evaluation of a triple
pattern, tp, such that tp.s =?si ∧ tp.p ∈ P ∧ tp.o =?oi , over a triple store. A
pair of values (sj , oj ) that are bound to variables (?si , ?oi ) form a tuple in pI .
Algorithm 1 also ranks the predicates in the order in which they are discovered.
The signiﬁcance of ranking will become evident in the next section.
    We now give an example of a RDF dataset D, which is an extension of
the dataset from [12], and its corresponding relational schema S generated by
6       RETRO: A Semantics Preserving SQL-to-SPARQL Translation

Algorithm 1. We also give an example of an instance of relation “name” that
can be constructed from D. When the number of predicates in a ABox is very
large, we envisage providing a user with an auxiliary schema for the sole purpose
of understanding the structure of data to help them with query formulation.
However, the implementation of such a schema is left as a part of future work.
S = { name(s, o), age(s, o), sal(s, o), webPage(s, o),
      dept(s, o), phone(s, o), email(s, o) }


(B1 , name, paul)             (B2 , name, john)           (B3 , name, george)
(B4 , name, ringo)            (B1 , age, 30)              (B2 , age, 40)
(B3 , age, 25)                (B4 , age, 35)              (B1 , sal, 20,000)
(B2 , sal, 30,000)            (B3 , sal, 40,000)          (B4 , sal, 50,000)
(B2 , dept, Research)         (B3 , dept, Research)       (B3 , dept, Admin)
(B4 , dept, Admin)            (B1 , phone, 777-3426)      (B4 , phone, 888-4537)
(B2 , email, john@acd.edu)    (B4 , email, ringo@acd.edu) (B3 , webPage, www.george.edu)
(B4 , webPage, www.starr.edu)
                           Table 1. The RDF dataset D


SELECT ?s ?o                                  name
WHERE { ?s name ?o }               =>         s      o
                                              B1     paul
                                              B2     john
                                              B3     george
                                              B4     ringo
Table 2. A SPARQL query and relation instance obtained from its evaluation on D
for predicate “name”


     We would now like to delve into how RETRO as a tool caters to knowledge
evolution and ontology dynamics. Knowledge representation is an important
aspect of knowledge. As mentioned earlier, over the past several decades ap-
plication requirements have evolved and commensurately the ways to represent
knowledge have also evolved. However, the gap between the users of the two
knowledge representation methods, namely relational and RDF represenatation,
still remains. Often, an analogy is established between databases and DL knowl-
edgebases by comparing the schema of a database to a TBox and a database
instance to an ABox. This analogy acts as a stepping stone towards the process
of bridging the gap between them. RETRO supports evolution of knowledge
representation by bridging the gap between the two knowledge representation
paradigms through schema mapping and query mapping. RETRO will addition-
ally facilitate application of existing database technologies for e.g., data analysis
             RETRO: A Semantics Preserving SQL-to-SPARQL Translation           7

tools used in stream databases [11], to RDF data. This will enable one to cap-
ture and analyze the dynamics of RDF data sets using the aforementioned tools,
which are much more advanced than the current Semantic Web state of the art.

3     RETRO Framework: Query Mapping
In this section we ﬁrst illustrate the SQL-to-SPARQL translation with some
examples and then describe an abstract grammar for SQL which is used to
build parse trees for a given SQL query. We then describe an algorithm that
translates a given SQL query into a semantically equivalent SPARQL query. In
our examples we use the RDF dataset D and its associated relational schema S
that were deﬁned in the previous section.
    In Table 3, we give a description of an SQL query on schema S, its equivalent
SPARQL query over dataset D obtained from Algorithm 2 and subsequently, we
show the result of evaluating the SPARQL query on D. The SQL queries use the
following relational algebra operators: Join, Union and Diﬀerence. Additional
example queries for other operators can be found in our technical report [14].

3.1   Abstract Grammar for SQL
As shown in the abstract grammar below, an input SQL query for the transla-
tion can be a simple Cross product query, Select-Project-Join query or queries
with Union, Intersection or Diﬀerence (Except) operators. Additionally, the
Query-Mapping algorithm (Algorithm 2) supports equi-join and does not con-
sider conditional join because of the nature of RDF data. Conditional joins are
meaningful in relational data, however, since the relational schema is derived
from RDF data and joins over the derived schema are usually performed over
attributes which are URI’s, conditional joins do not come into picture [14].
SqlQuery          --> CrossProductQuery | SPJQuery | UnionQuery |
                      IntersectQuery | ExceptQuery.
CrossProductQuery --> SelectClause FromClause.
SPJQuery          --> SelectClause FromClause WhereClause.
UnionQuery        --> SqlQuery ’UNION’ SqlQuery.
IntersectQuery    --> SqlQuery ’INTERSECT’ SqlQuery.
ExceptQuery       --> SqlQuery ’EXCEPT’ SqlQuery.
SelectClause      --> ’SELECT’ AttributeList.
FromClause        --> ’FROM’ TableList.
WhereClause       --> ’WHERE’ ExpressionList.
AttributeList     --> Attribute’,’ AttributeList | Attribute.
Attribute         --> String.
TableList         --> Table’,’TableList | Table.
Table             --> String.
ExpressionList    --> Expression Lop ExpressionList | Expression.
Expression        --> Attribute Op Attribute | Attribute Op Value.
Value             --> Numeral | String.
Op                --> ’<’ | ’>’ | ’=’ | ’>=’ | ’<=’ | ’<>’.
Lop               --> ’AND’ | ’OR’ | ’NOT’ .
8       RETRO: A Semantics Preserving SQL-to-SPARQL Translation


Description             SQL Query            SPARQL Query               ResultSet
Find the email          SELECT email.o       SELECT ?o2
address of john         FROM name, email WHERE { ?s1 name ?o1 .           ?o2
                        WHERE                      ?s1 email ?o2 .        john@acd.edu
                        name.s = email.s     FILTER ( ?o1 = john ) }
                        AND name.o = john
Names of all people     SELECT name.o        SELECT ?o1
who either have a       FROM name, email WHERE {                          ?o1
phone number or an      WHERE                    { ?s1 name ?o1 .         paul
email address             name.s = email.s       ?s1 email ?o2 . }        john
                        UNION                    UNION                    ringo
                        SELECT name.o            { ?s1 name ?o1 .
                        FROM name, phone         ?s1 phone ?o3 . }
                        WHERE                }
                          name.s = phone.s
Find the resources that SELECT dept.s        SELECT ?s1
work for the Research   FROM dept            WHERE { {
department but not      WHERE                    ?s1 dept ?o1 .            ?s1
the Admin department      dept.o = Research FILTER                         B2
                        EXCEPT                   ( ?o1 = Research ) }
                        SELECT dept.s        OPTIONAL {
                        FROM dept                ?s1 dept ?o2 .
                        WHERE                FILTER ( ?o2 = Admin ) }
                          dept.o = Admin    FILTER ( !bound(?o2) )
        Table 3. Illustration of SQL-to-SPARQL translation on dataset D


3.2   Query Mapping algorithms

The conceptual evaluation strategy of an SQL query ﬁrst computes the cross-
product of the tables in the FROM list, then deletes the rows in the cross-product
that fail the qualiﬁcation conditions and ﬁnally deletes all the columns that do
not appear in the SELECT list. Additionally, if DISTINCT is speciﬁed, duplicate
rows are eliminated. Algorithm 2 is used to construct a SPARQL query and is
based on the conceptual query evaluation strategy of SQL. It takes as input an
SQL query string and a map P generated from Algorithm 1, and returns an
equivalent SPARQL query string. The SQL query string is ﬁrst used to generate
a parse tree using function parse(SQLQuery). We then extract SQL SELECT,
FROM, WHERE-JC (Join Conditions of the form: Attr Op Attr) and WHERE-
              RETRO: A Semantics Preserving SQL-to-SPARQL Translation                    9

BE (Boolean Expressions of the form: Attr Op V alue) clause strings from the
parse tree. The procedure then performs a query translation based on the type of
a query. Algorithm 2 makes use of several sub-procedures to construct a SPARQL
query. We brieﬂy explain each of these sub-procedures below. An interested
reader can ﬁnd detailed explanations of these sub-procedures in our technical
report [14].


Algorithm 2 Query-Mapping()
Input: An SQL query, qin , and a map P
Output: A SPARQL query, qout
 1: qout =“” {A SPARQL query that is initially blank}
 2: tree = parse(qin ) { A parse tree obtained by parsing qin }
     SELECT                                   F ROM
 3: qin        = tree.getSelectClause(); qin         = tree.getF romClause();
     W HERE−JC                                      W HERE−BE
 4: qin            = tree.getJoinConditions() qin               = tree.getBooleanExpr()
     SELECT
 5: qout       =“SELECT ” qout    W HERE
                                           =“W HERE { ”
 6: T P ← ∅ { The map of triple patterns is initially empty}
 7: if tree.type = CrossProductQuery then
                                            F ROM
 8:    T P = TransSQLFromClause(qin               )
        W HERE                                        W HERE−JC       W HERE−BE
 9:    qout      + = TransSQLWhereClause(qin                       , qin         , T P, P )
         SELECT                                        SELECT
10:    qout       + = TransSQLSelectClause(qin                 ,TP)
11:             SELECT
       qout = qout          W HERE
                        + qout         +“ }”
12: else if tree.type = SPJQuery then
                                            F ROM
13:    T P = TransSQLFromClause(qin               ))
         W HERE                                      W HERE−JC        W HERE−BE
14:    qout      + = TransSQLWhereClause(qin                      , qin         , T P, P )
                           W HERE
15:    T P = extractT P (qout         )
         SELECT                                        SELECT
16:    qout       + = TransSQLSelectClause(qin                 ,TP)
17:             SELECT
       qout = qout          W HERE
                        + qout         +“ }”
18: else if tree.type = UnionQuery then
19:    q1 = tree.lef tSubT ree(), q2 = tree.rightSubT ree()
20:    q1out = Query-Mapping(q1 ), q2out = Query-Mapping(q2 )
21:    qout = Merge(q1out , q2out , “UNION”)
22: else if tree.type = IntersectQuery then
23:    q1 = tree.lef tSubT ree(), q2 = tree.rightSubT ree()
24:    q1out = Query-Mapping(q1 ), q2out = Query-Mapping(q2 )
25:    qout = Merge(q1out , q2out , “INTERSECT”)
26: else if tree.type = ExceptQuery then
27:    q1 = tree.lef tSubT ree(), q2 = tree.rightSubT ree()
28:    q1out = Query-Mapping(q1 ), q2out = Query-Mapping(q2 )
29:    qout = Merge(q1out , q2out , “EXCEPT”)
30: end if
31: return qout


    TransSQLFromClause: This sub-procedure takes as input a set of table
names from a SQL FROM clause, R, and returns a map from table names to
triple patterns, T P . For each table name that is present in a FROM clause, the
sub-procedure generates a triple pattern of the form, ?si ri ?oi , where ri denotes
10      RETRO: A Semantics Preserving SQL-to-SPARQL Translation

a given table name. A generated triple pattern is then added to map T P indexed
by its corresponding table name.
    TransSQLWhereClause: This sub-procedure takes sets of join conditions
JC and boolean expressions BE, a map T P from TransSQLFromClause
and map P generated from Algorithm 1 as inputs and returns a SPARQL
WHERE clause string. In this sub-procedure, we use the aforementioned rank-
ing of relations to sort the set JC to ensure that a consistent join ordering
is obtained. We present the following example to show the necessity of rank-
ing relations: Consider relations name, age and dept with ranks 1, 2 and 3
and a query containing only join conditions “name.s = age.s AND dept.s =
age.s”. If we do not rank relations, we would get the SPARQL WHERE clause:
{?s1 name ?o1 . ?s3 age ?o2 . ?s3 dept ?o3 }. However, using a sorting procedure,
the second join condition would become age.s = dept.s, since the rank of dept (3)
was originally > the rank of age (2). The sorted join conditions now lead to the
correct WHERE clause: {?s1 name ?o1 . ?s1 age ?o2 . ?s1 dept ?o3 }. Thus, the
ranking of relations used in join re-ordering ensures generation of a consistent
SPARQL WHERE clause. Note that the process of ranking relations and sort-
ing join conditions is merely an implementation detail and has little conceptual
signiﬁcance. The sub-procedure then uses the sorted join conditions to generate
a part of the SPARQL WHERE clause string. This is followed by generating
ﬁlter conditions using triple patterns in T P and boolean expressions present in
a SQL query.
    TransSQLSelectClause: This sub-procedure takes as input a set of SQL
SELECT attributes, A, and map T P , and returns a SPARQL SELECT clause
string. It generates the variables for SPARQL SELECT by iterating over the
set A, where an attribute ai ∈ A is of the form “relationName.attributeName”.
During an iteration, the relation name part of an attribute ai is used to access
the corresponding triple pattern from T P and the attribute name part of ai is
used to access the desired variable within this triple pattern.
    Merge: This sub-procedure is used to merge two generated SPARQL sub-
query strings for SQL queries containing Union, Intersection or Diﬀerence op-
erators. The sub-procedure ﬁrst extracts the necessary components of both
SPARQL sub-queries (SELECT clause, triple patterns and ﬁlter expressions in
WHERE clause). Next, the two sub-queries are combined into a merged query
using various operations depending on the operator in the original SQL query.
These operations include concatenation of triple patterns of both sub-queries,
variable renaming to ensure consistency and introduction of speciﬁc keywords.


4    Semantics Preserving SQL-To-SPARQL Translation

An important principle of compositional semantics is that the semantics should
be compositional i.e. the meaning of a program expression should be built out
of the meanings of its sub-expressions. We present compositional semantics in
terms of abstract syntax deﬁnition of the language, semantic algebras and the
valuation functions.
             RETRO: A Semantics Preserving SQL-to-SPARQL Translation           11

 – Abstract Syntax: It is commonly speciﬁed as BNF grammar and it maps
   expressions in the language to parse trees. A Syntax domain is the term
   used to denote a collection of values with common syntactic structure.
 – Semantic Algebra: A Semantic algebra consists of a semantic domain accom-
   panied by a corresponding set of operations. A semantic domain is a set of
   elements grouped together because they share some common property, for
   e.g., set of natural numbers, set of tuples in a relation etc. The operations
   are functions that need arguments from the domain to produce answers,
   for e.g., selection, projection, join, are meaningful operations over the do-
   main of relations. An Operation is deﬁned in terms of its functionality and
   a description of its mapping.
     • Functionality: For a function f , its functionality is an expression from
        domain and co-domain denoted by f : D1 × D2 × ... × Dn → A, which
        implies f needs an argument from domain D1 and one argument from
        D2 ,..., and one from Dn to produce an answer belonging to domain A.
     • Description: A description of an operation’s mapping is generally given
        as an equational deﬁnition. However, a set, a graph, a table or a diagram
        can also be used. In this paper we use equational deﬁnitions.
 – Valuation Function: A Valuation function is a collection of functions, one for
   each syntax domain. They map elements of syntax domain (Abstract syntax)
   to elements of semantic domain (Semantic algebra), for e.g., a valuation
   function can be used to map a query string (syntax domain) to its result
   relation (semantic domain).

4.1   Compositional Semantics of SPARQL
In this section, we ﬁrst describe the related formal notation and then give ab-
stract syntax, semantic algebra of mapping sets and valuation functions as the
compositional semantics for SPARQL.
 – RDF terms: Denoted by T , it comprises of pairwise disjoint inﬁnite sets of
   I, B and L (IRI’s, Blank nodes and Literals respectively).
 – Triple: A triple (s, p, o) ∈ (I ∪ B) × I × (I ∪ B ∪ L), where s is a subject, p
   is a predicate, and o is an object.
 – Triple Pattern: A triple pattern, tp, is a triple of the form (sp, pp, op) ∈
   (I ∪ V ∪ L) × (I ∪ V ) × (I ∪ V ∪ L), where V is an inﬁnite set of variables
   that is pairwise disjoint from the sets I, B, and L; and sp, pp, and op are a
   subject pattern, predicate pattern, and object pattern respectively.
 – A mapping µ is a partial function µ : V → T . Given a triple pattern t,
   µ(t) denotes the triple obtained by replacing the variables in t according to
   µ. Domain of µ, dom(µ) is the subset of V where µ is deﬁned and Ω is a
   set of mappings µ. Two mappings, µ1 and µ2 are compatible when for all
   x ∈ dom(µ1 ) ∩ dom(µ2 ), it is the case that µ1 (x) = µ2 (x).
Abstract Syntax for SPARQL
The abstract syntax shown below describes the structure of graph pattern ex-
pressions.
12       RETRO: A Semantics Preserving SQL-to-SPARQL Translation

gp ∈ graphpatterns, where gp represents the syntax domain of graph patterns
gp ::= tp | gp F ILT ER expr | P ROJECTv1 ,v2 ,..,vn gp | gp U N ION gp | gp AN D gp | gp OP T gp

Semantic Algebra for the domain of mapping sets
Semantic Domain: Mapping Set, Ω.
Operations deﬁned over the semantic domain Ω are:
 1. The σexpr operator takes a mapping set as an input and retains only the
    mappings that satisfy the expression expr and returns the resulting mapping
    set. Its functionality is denoted by σexpr : Ω → Ω and description by:
      – σ∏expr Ω = {µ|µ ∈ Ω ∧ µ  expr}
 2. The vars operator takes an input mapping set and retains only the variables
    belonging to the set vars in each ∏     mapping and returns the resulting mapping
         Its functionality is given by v1 ,v2 ,..,vn : Ω → Ω and description by:
    set. ∏
      – v1 ,v2 ,..,vn Ω = {µ|v1 ,v2 ,..,vn |µ ∈ Ω}
 3. The Union operator ∪, takes as input two mapping sets and computes the
    set union and returns the resulting mapping set. Its functionality is given by
    ∪ : Ω × Ω → Ω and description by:
      – Ω1 ∪ Ω2 = {µ|µ ∈ Ω1 ∨ µ ∈ Ω2 }
 4. The Join operator’s functionality is given by o     n: Ω × Ω → Ω, description by:
      – Ω1 o n Ω2 = {µ1 ∪ µ2 |µ1 ∈ Ω1 , µ2 ∈ Ω2 are compatible mappings}
 5. Diﬀerence operator’s functionality is r : Ω × Ω → Ω, description is:
      – Ω1 r Ω2 = {µ ∈ Ω1 | for all µ0 ∈ Ω2 , µ and µ0 are not compatible }
 6. The Optional operator’s functionality is          : Ω × Ω → Ω, description is:
      – Ω1 Ω2 = {(Ω1 o    n Ω2 ) ∪ (Ω1 r Ω2 )}
Valuation functions
   []: gp → Ω ; The valuation function [], maps an element of gp to Ω, i.e, it
maps the language’s abstract syntax structures to meanings drawn from semantic
domains. The domain of a valuation function is a set of derivation trees of the
language and is deﬁned structurally. It determines meanings of its subtrees and
combines them into meaning of the entire tree.
  – [tp ]= {µ|dom(µ) = var(tp) ∧ µ(tp) ∈ G}
  – [gp F ILT ER expr]= σexpr∏[gp ]
  – [P ROJECTv1 ,v2 ,..,vn gp]= v1 ,v2 ,..,vn [gp ]
  – [gp1 U N ION gp2 ]= [gp1 ]∪ [gp2 ]
  – [gp1 AN D gp2 ]= [gp1 ]o n [gp2 ]
  – [gp1 OP T gp2 ]= [gp1 ] [gp2 ]
The meaning of expression [gp1 U N ION gp2 ]is computed by ﬁrst evaluating
[gp1 ]to get Ω1 and [gp2 ]to get Ω2 and then computing Ω1 ∪Ω2 using the mapping
set’s semantic algebra operation union, deﬁned earlier.
Given a mapping µ and expression expr, µ satisﬁes expr (µ  expr), iﬀ
  (i) expr is bound(?X) and ?X ∈ dom(µ);
 (ii) expr is ?X op c, ?X ∈ dom(µ) and µ(?X) op c;
(iii) expr is ?X op ?Y , ?X, ?Y ∈ dom(µ), and µ(?X) op µ(?Y ),
      where op ∈ {< | ≤ | ≥ |>| =};
(iv) expr is (¬expr1 ) and it is not the case that µ  expr1 ;
 (v) expr is (expr1 ∧ expr2 ) and µ  expr1 and µ  expr2 ;
(vi) expr is (expr1 ∨ expr2 ) and µ  expr1 or µ  expr2 ;
             RETRO: A Semantics Preserving SQL-to-SPARQL Translation             13

4.2   Compositional Semantics of Relational Algebra
The close relationship between SQL and Relational Algebra is the basis for query
optimization in a RDBMS [15], and in our work it forms the basis for query trans-
lation. In order to simplify the proof, we use relational algebra based syntax and
deﬁne its compositional semantics by using tuple relational calculus formulas
since they closely mirror the mapping based semantics of SPARQL. The proof
of equivalence between relational algebra and relational calculus presented in [7]
enables us to deﬁne the compositional semantics in this manner.

Abstract Syntax for Relational Algebra The abstract syntax describes the
structure of relational algebra (RA) expressions.
E ∈ Expression, ∏  where E represents the syntax domain of RA expressions.
    E → σCond E| A1 ,A2 ,..,An E|E1 ∪ E2 |E1 o
                                             nCond E2 |E1 –E2 |E1 ∩ E2 |E1 × E2

Semantic Algebra for the domain of relations
Semantic Domain: Relation, R ; A relation is a set of all tuples such that they
are all deﬁned over the same schema or set of attributes.
Operations deﬁned over the semantic domain R are:
1. σCond operator takes a relation as an input and retains only the tuples that
   satisfy the expression Cond and returns the resulting relation. Cond is a
   boolean expression deﬁned in the abstract syntax of SQL. Its functionality
   is denoted by σCond : R → R and description by:
     – σCond R = {t|R(t) ∧ Cond(t)} where R(t) denotes tuple t belongs to
       relation R.                              ∏
2. The∏ Project operator’s functionality is A1 ,A2 ,..,An : R → R, description is:
     – A1 ,A2 ,..,An R = {t.A1 , t.A2 , .., t.An |R(t)}
3. Union operator’s functionality is ∪ : R × R → R and description is:
     – R1 ∪ R2 = {t|(R1 (t) ∨ R2 (t)) ∧ (ξ(R1 ) ≡ ξ(R2 ))}
       A Union operation can be performed only over union compatible re-
       lations i.e. their schemas should be identical and it is denoted by the
       condition (ξ(R1 ) ≡ ξ(R2 )) where ξ(R) stands for schema of a relation
       R.
4. The Join operator’s functionality is o    nCond : R1 × R2 → R and description is:
     – R1 o nCond R2 = {t|∀t1 , ∀t2 (R1 (t1 ) ∧ R2 (t2 ) → t.A1 = t1 .A1 ∧ t.A2 =
       t1 .A2 ∧ ... ∧ t.An = t1 .An ∧ t.B1 = t2 .B1 ∧ t.B2 = t2 .B2 ∧ ... ∧ t.Bn =
       t2 .Bn ∧ Cond(t))}
5. Diﬀerence operator’s functionality is – : R × R → R and description is:
     – R1 –R2 = {t|R1 (t) ∧ ∀t2 (R2 (t2 ) → t 6= t2 )}
6. Intersection operator’s functionality is ∩ : R × R → R and description is:
     – R1 ∩ R2 = {t|(R1 (t) ∧ R2 (t)) ∧ (ξ(R1 ) ≡ ξ(R2 ))}
7. Cross product operator’s functionality is × : R × R → R and description is:
     – R1 × R2 = {t|∀t1 , ∀t2 (R1 (t1 ) ∧ R2 (t2 ) → t.A1 = t1 .A1 ∧ t.A2 = t1 .A2 ∧
       ... ∧ t.An = t1 .An ∧ t.B1 = t2 .B1 ∧ t.B2 = t2 .B2 ∧ ... ∧ t.Bn = t2 .Bn )}
       where A1 , A2 , .., An are attributes of relation R1 and B1 , B2 , .., Bn are
       attributes of R2 .
14      RETRO: A Semantics Preserving SQL-to-SPARQL Translation

Valuation functions
   []r : E → R ; The valuation function []r , maps an element of syntax domain
E to an element of semantic domain R.
 – [σ
    ∏Cond E ]r = σCond [∏  E ]r
 – [ A1 ,A2 ,..,An E ]r = A1 ,A2 ,..,An [E ]r
 – [E1 ∪ E2 ]r = [E1 ]r ∪ [E2 ]r
 – [E 1 o
        nCond E2 ]r = [E1 ]r o  nCond [E2 ]r
 – [E1 –E2 ]r = [E1 ]r – [E2 ]r
 – [E1 ∩ E2 ]r = [E1 ]r ∩ [E2 ]r
 – [E 1 × E 2 ]r = [E 1 ]r × [E 2 ]r
Theorem 1. Given an RDF dataset D, an SQL query, sql ∈ SqlQuery, with
an equivalent relational algebra expression RA ∈ E over a relational schema S
derived from D, the translation of sql to an equivalent SPARQL query, sparql(or
an equivalent sparql algebra expression SA), over dataset D, is a semantics pre-
serving translation.
Proof. A query is made up of operators and operands and two given queries are
equivalent if they are made up of semantically equivalent operators applied on
equivalent data in the same sequence. Having deﬁned the compositional seman-
tics for Relational Algebra and SPARQL, proving the semantic equivalence boils
down to equating their respective semantic algebras.
 – Selection If Ω ≡ [tp ], Cond ≡ expr and R ∈ S such that tp = ?s R ?o
   then RA ≡ SA.
   RA: σCond R = {t|R(t) ∧ Cond(t)}
   SA: F ILT ERexpr Ω = {µ|µ ∈ Ω ∧ expr(µ)}
   The conditions Ω ≡ [tp ]and R ∈ S such that tp = ?s R ?o ensure the data
   equivalence and RA ≡ SA stand for operator equivalence.
 – Projection If Ω ≡ [tp ], {∀i, Ai ≡ vi , 1 ≤ i ≤ n} and R ∈ S such that tp =
   ?s R ∏?o then RA ≡ SA.
   RA: A1 ,A2 ,..,An R = {t.A1 , t.A2 , .., t.An |R(t)}
   SA: P ROJECTv1 ,v2 ,..,vn Ω = {µ|v1 ,v2 ,..,vn |µ ∈ Ω}
 – Union If Ω1 ≡ [gp1 ], Ω2 ≡ [gp2 ], gp1 = tp1 AN D tp2 , gp2 = tp1 AN D tp3 ,
                 ∏3 ∈ S such that tp1 =?s1 R1 ?o1 , tp2∏=?s1 R2 ?o2 , tp3 =?s1 R3 ?o3
   and R1 , R2 , R
   and Rx =        R1 .s,R1 .o (R1 o
                                   n R2 ) and Ry =        R1 .s,R1 .o (R1 o
                                                                          n R3 ), then
   RA ≡ SA.
   RA: ∏ Rx ∪ Ry = {t|(Rx (t) ∨ Ry (t)) ∧ (ξ(Rx ) ≡ ξ(Ry ))}
   SA: ?s1 ,?o1 (Ω1 ∪ Ω2 ) = {µ|s1,o1 |µ ∈ Ω1 ∨ µ ∈ Ω2 }
 – Join If Ω1 ≡ [tp1 ],Ω2 ≡ [tp2 ], Cond ≡ expr, and R1 , R2 ∈ S such that
   tp1 =?s1 R1 ?o1 , tp2 =?s1 R2 ?o2 then RA ≡ SA3.
   RA: R1 o  nCond R2 = {t|∀t1 , ∀t2 (R1 (t1 ) ∧ R2 (t2 ) → t.A1 = t1 .A1 ∧ t.A2 =
   t1 .A2 ∧ t.B1 = t2 .B1 ∧ t.B2 = t2 .B2 ∧ Cond(t))}
   SA1: Ω1 o  n Ω2 = {µ1 ∪ µ2 |µ1 ∈ Ω1 , µ2 ∈ Ω2 are compatible mappings}
   SA2: F ILT ERexpr Ω = {µ|µ ∈ Ω ∧ expr(µ)}
   SA3: F ILT ERexpr (Ω1 o      n Ω2 )
   Semantics of SA3 is composed of semantics of SA1 and SA2.
             RETRO: A Semantics Preserving SQL-to-SPARQL Translation               15

 – Diﬀerence If Ω1 ≡ [gp1 ], Ω2 ≡ [gp2 ], R ∈ S such that tp1 =?s1 R?o1 ,
   tp2 =?s∏1 R?o2 , gp1 = F ILT ER?o1 Op “V alue1
                                               ∏ 00 (tp1 ), gp2 = F ILT ER?o1 Op “V alue200 (tp2 ),
   R1 = R.s (σR.o Op “V alue100 (R)), R2 = R.s (σR.o Op “V alue200 (R)), expr ≡
   ¬bound(?o2 ) , then RA ≡ SA3.
   RA: R1 –R2 = {t|R1 (t) ∧ ∀t2 (R2 (t2 ) ∧ t 6= t2 ) ∧ (ξ(R1 ) ≡ ξ(R2 ))}
   SA1: Ω1 Ω2 = {Ω1 o       n Ω2 ) ∪ (Ω1 r Ω2 )}
   SA2: F ∏ ILT   ER expr Ω = {µ|µ ∈ Ω ∧ expr(µ)}
   SA3: ?s1 (F ILT ERexpr (Ω1 Ω2 ))
   Semantics of SA3 is composed of semantics of SA1 and SA2.
 – Intersection If Ω1 ≡ [gp1 ], Ω2 ≡ [gp2 ], gp1 = tp1 AN D tp2 , gp2 =
   tp1 AN D tp3 , and R1 , R2 , ∏R3 ∈ S such that tp1 =?s1 R1 ?o1 ,∏tp2 =?s1 R2 ?o2 ,
   tp3 =?s1 R3 ?o3 and Rx = R1 .s,R1 .o (R1 o   n R2 ) and Ry = R1 .s,R1 .o (R1 o  n
   R3 ), then RA ≡ SA.
   RA: ∏ Rx ∩ Ry = {t|(Rx (t) ∧ Ry (t)) ∧ (ξ(Rx ) ≡ ξ(Ry ))}
   SA: ?s1 ,?o1 (Ω1 o  n Ω2 ) = {(µ1 ∪ µ2 )|s1,o1 |µ1 ∈ Ω1 , µ2 ∈ Ω2 are compatible
   mappings}
 – Cross product If Ω1 ≡ [tp1 ],Ω2 ≡ [tp2 ]and R1 , R2 ∈ S such that tp1 =
   ?s1 R1 ?o1 , tp2 =?s2 R2 ?o2 then RA ≡ SA.
   RA: R1 × R2 = {t|∀t1 , ∀t2 (R1 (t1 ) ∧ R2 (t2 ) → t.A1 = t1 .A1 ∧ t.A2 = t1 .A2 ∧
   t.B1 = t2 .B1 ∧ t.B2 = t2 .B2 )}
   SA: Ω1 o n Ω2 = {µ1 ∪ µ2 |µ1 ∈ Ω1 , µ2 ∈ Ω2 are compatible mappings}

Since all the operators in relational algebra are proved equivalent to correspond-
ing operators in SPARQL, it can be concluded that ∀sql ∈ SqlQuery, where RA
is relational algebra expression of sql, [RA]r ≡ [sparql ]. Further explanation
of the proof can be found in [14].


5   Related work

There have been several attempts to make RDBMS and RDF stores inter-
operate. The most popular one, D2RQ [3], has explored one direction i.e they
construct an RDF schema from Relational data. A similar eﬀort, W3C RDB2RDF
WG [9], aims to generate RDF triples from one or more Relational tables without
loss of information. In [8], the author describes a transformation from SPARQL
to relational algebra and attempts to make existing work on query planning and
optimization available to SPARQL implementors. RETRO provides the transla-
tion in the reverse direction and proves the translation to be semantics preserving
by reusing the work done on compositional semantics of SPARQL in [12] and [5].
Compositional semantics of SPARQL discussed in [12] is reused and extended
by authors in [5] by providing additional operators such as F ILT ERexpr and
SELECT (v1 , v2 ..vn ). However, unlike [12] and [5], we explicitly deﬁne compo-
sitional semantics in terms of valuation functions, which map the elements of
syntax domain to elements of the semantic domain and use it towards the proof
of semantics preservation of the query translation.
16      RETRO: A Semantics Preserving SQL-to-SPARQL Translation

6    Conclusion and Future Work
In this paper, we have provided interoperability between RDF Stores and RDBMS
by ﬁrstly deriving a relational schema from the RDF store, and secondly provid-
ing a means to translate an SQL query to a semantically equivalent SPARQL
query. In future versions of RETRO, we plan to provide an algorithm for deriving
a user friendly relational schema, and extend the query mapping algorithm with
additional and advanced SQL operators such as aggregation. Another promising
direction for future work is to leverage the query optimization available for SQL
queries to generate eﬃcient, equivalent SPARQL queries.

References
 1. Abadi, D.J., 0002, A.M., Madden, S., Hollenbach, K.J.: Scalable Semantic Web
    Data Management Using Vertical Partitioning. In: VLDB. pp. 411–422 (2007)
 2. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.
    (eds.): The Description Logic Handbook: Theory, Implementation, and Applica-
    tions. Cambridge University Press (2003)
 3. Bizer, C., Cyganiak, R.: D2RQ Lessons Learned. http://www.w3.org/2007/03/
    RdfRDB/papers/d2rq-positionpaper/
 4. Calvanese, D., Lenzerini, M., Nardi, D.: Unifying Class-Based Representation For-
    malisms. J. Artif. Intell. Res. (JAIR) 11, 199–240 (1999)
 5. Chebotko, A., Lu, S., Fotouhi, F.: Semantics preserving SPARQL-to-SQL transla-
    tion. Data Knowl. Eng. 68(10), 973–1000 (2009)
 6. Codd, E.F.: Relational Completeness of Data Base Sublanguages. In: R. Rustin
    (ed.): Database Systems: 65-98, Prentice Hall and IBM Research Report RJ 987,
    San Jose, California (1972)
 7. Codd, E.F.: Relational Completeness of Data Base Sublanguages. In: R. Rustin
    (ed.): Database Systems: 65-98, Prentice Hall and IBM Research Report RJ 987,
    San Jose, California (1972)
 8. Cyganiak, R.: A relational algebra for SPARQL (2005), http://www.hpl.hp.com/
    techreports/2005/HPL-2005-170.html
 9. Malhotra, A.: W3C RDB2RDF Incubator Group Report. http://www.w3.org/
    2005/Incubator/rdb2rdf/XGR-rdb2rdf-20090126/ (2009)
10. Motik, B., Horrocks, I., Sattler, U.: Bridging the gap between OWL and relational
    databases. J. Web Sem. 7(2), 74–89 (2009)
11. Parker, D.S.: Stream data analysis in Prolog, pp. 249–312. MIT Press, Cambridge,
    MA, USA (1990), http://dl.acm.org/citation.cfm?id=103477.103548
12. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of sparql. In: In-
    ternational Semantic Web Conference. pp. 30–43 (2006)
13. Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF. W3C Rec-
    ommendation (2008)
14. Rachapalli, J., Khadilkar, V., Kantarcioglu, M., Thuraisingham, B.:
    RETRO: A Framework for Semantics Preserving SQL-to-SPARQL Transla-
    tion.       http://www.utdallas.edu/~vvk072000/Research/Technical-Report/
    technical-report.html (2011)
15. Ramakrishnan, R., Gehrke, J., Ramakrishnan, R., Gehrke, J.: Database Man-
    agement Systems. McGraw-Hill Science/Engineering/Math, 3 edn. (Aug 2002),
    http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20\&path=
    ASIN/0072465638

</pre>