=Paper=
{{Paper
|id=None
|storemode=property
|title=Optimizing SPARQL Queries over the Web of Linked Data
|pdfUrl=https://ceur-ws.org/Vol-637/paper2.pdf
|volume=Vol-637
}}
==Optimizing SPARQL Queries over the Web of Linked Data==
<pdf width="1500px">https://ceur-ws.org/Vol-637/paper2.pdf</pdf>
<pre>
  Optimizing SPARQL queries over the Web of Linked Data

                          B. R. Kuldeep Reddy                                               P. Sreenivasa Kumar
                  Indian Institute of Technology Madras                              Indian Institute of Technology Madras
                               Chennai, India                                                     Chennai, India
                        brkreddy@cse.iitm.ac.in                                               psk@cse.iitm.ac.in


ABSTRACT                                                                         mantics[1] and the links between documents are not expres-
The web of linked data represents a globally distributed                         sive enough to establish the relationship between them. This
dataspace. It can be queried with SPARQL whose execution                         has lead to the emergence of the global data space known as
takes place by asynchronously traversing the RDF links to                        Linked Data[1]. Linked data basically interconnects pieces
discover data sources at run-time. However, the optimiza-                        of data from different sources utilizing the existing web in-
tion of SPARQL queries over the web of data remains a chal-                      frastructure. The data published is machine readable which
lenge and in this paper we present an approach addressing                        means it is explicitly defined. Instead of using HTML, linked
this problem. The proposed approach works in two-phases                          data uses RDF format to represent data. The connection
to optimize the SPARQL queries. The first phase analyzes                         between data is made by typed statements in RDF which
the query before its execution and discovers classes of data                     clearly defines the relationship between them resulting in
that do not contribute towards answering it which can then                       a web of data. The Linked Data Principles outlined by
be prevented from being fetched. However, this analysis                          Berners-Lee for publishing data on the web basically sug-
may miss a number of patterns that can only be discovered                        gests using URIs for names of things which are described in
at run-time. The second phase analyzes the query execu-                          RDF format and accessed using HTTP protocol.
tion to discover more patterns which are used to further                            The RDF model describes data in the form of subject,
reduce the amount of data fetched from the web to answer                         predicate and object triples. The subject and object of a
the query. The main idea of this phase is to model the query                     triple can be both URIs that each identify an entity, or a
execution as a context graph that is used by a heuristic to                      URI and a string value respectively. The predicate denotes
discover the patterns which are passed sideways to prune                         the relationship between the subject and object, and is also
nodes in the context graph, resulting in an improvement in                       represented by a URI. SPARQL is the query language pro-
query performance. The implementation of our approach                            posed by W3C recommendation to query RDF data[8]. A
demonstrates its benefits.                                                       SPARQL query basically consists of a set of triple patterns.
                                                                                 It can have variables in the subject,object or predicate po-
                                                                                 sitions in each of the triple pattern. The solution consists
1.    INTRODUCTION                                                               of binding these variables to entities which are related with
  Semantic data management includes the various techniques                       each other in the RDF model according to the query struc-
that use data based on its meaning. There has been a recent                      ture. An execution approach for evaluating SPARQL queries
growth in publishing data on the web following the linked                        on linked data is proposed in [6]. The idea presented in [5]
data principles forming a global web of linked data. Efficient                   maintains index structures that are used to select relevant
querying of this web of linked data is thus an important com-                    sources for query answering, as in federated query process-
ponent of Semantic Data Management and is addressed in                           ing. However, the index covering large amount of data on
this paper.                                                                      the Web requires much more resources. According to [6] a
  The traditional World Wide Web has allowed sharing of                          SPARQL query is basically executed by iteratively derefer-
documents among users on a global scale. The documents                           encing URIs to fetch their RDF descriptions from the web
are generally represented in HTML, XML formats and are                           and building solutions from the retrieved data. It must have
accessed using URL and HTTP protocols creating a global                          a seed URI as the subject of the first query pattern. It is
information space. However, in the recent years the web has                      explained with an example below.
evolved towards a web of data[3] as the conventional web’s
data representation sacrifices much of its structure and se-


To copy without fee all or part of this material is permitted only for private
and academic purposes, given that the title of the publication, the authors
and its date of publication appear. Copying or use for commercial purposes,
or to republish, to post on servers or to redistribute to lists, is forbidden
unless an explicit permission is acquired from the copyright owners; the                  Figure 1: Example SPARQL query
authors of the material.
Workshop on Semantic Data Management (SemData@VLDB) 2010,
September 17, 2010, Singapore.                                                     Example. The SPARQL query shown in Figure 1 searches
Copyright 2010: www.semdata.org.                                                 for friends of Tim-Lee who work for an entity and the num-
ber of students in that entity. The query execution begins        to its structure. A level is associated with each triple pat-
by fetching the RDF description of Tim-Lee by dereferenc-         tern in the query. The level of a node in the query graph can
ing his URI. The fetched RDF description is then parsed           also be viewed as its distance from the root in terms of num-
to gather a list of his friends. Parsing is done by looking       ber of predicates. Before the query execution can begin, we
for triples that match the first pattern in the query. The        analyze it with the help of vocabularies used to describe re-
object URIs in the matched triples form the list of Tim-          sources in this phase(we assume a mapping exists that maps
Lee’s friends. Lets say <http://site1/John.rdf>, <http:           elements from one vocabulary onto another as according to
//site2/Peter.rdf>, <http://site3/Mary.rdf> were found            [1] it is considered good practice to do so). The idea is to
to be his friends. The query execution proceeds by fetch-         identify classes of data which do not contribute during the
ing the RDF descriptions corresponding to John,Peter and          query execution in that they do not produce results. The
Mary. Lets say first Mary’s graph is retrieved. It is parsed      data corresponding to those classes is deemed redundant and
to check for triples matching the second query pattern and        may not be fetched from the web. The vocabulary contains
it is found that Mary works for a university <http://site4/       the domain and range values for the predicates as well as
universityA.rdf>. The university’s details are again fetched      the subclass hierarchy details. This information can be ex-
and the third triple pattern in the query is searched in the      ploited to restrict the query execution to only those classes
graph to get the number of students. Peter’s and John’s           which generate results.
graphs and their companies details would also be retrieved
and the query execution proceeded in a way similar to Mary’s.
But since Peter and John work for corporate entities which
do not have students, the third triple pattern is not answered
for them.
   The SPARQL query shown in Figure 1 can be optimized
as follows. Lets say Tim-Lee has friends who are either pro-
fessors or managers. Suppose Mary is a professor and Peter
and John are managers. The professor works for a univer-
sity and managers for a corporate entity. The third triple
pattern in the query can only be answered if the entity is a
university as only a university has students. Fetching the de-
tails of managers and the companies they work for does not
result in the third triple pattern being answered and there-                 Figure 2: Example Query Graph
fore they need not be fetched. The decision on whether to
continue the execution for a friend can be taken immediately
                                                                     Example. Let us consider the query of Figure 1 repre-
when his RDF description is fetched thus avoiding retriev-
                                                                  sented as a graph in Figure 2. The first triple pattern’s
ing his company information in-case he is a manager. In this
                                                                  predicate is hasFriend whose vocabulary description reveals
example when John’s,Peter’s and Mary’s data is fetched it is
                                                                  that its range is the class of people who can be either man-
determined that only Mary is a professor and the execution
                                                                  agers or professors. The second triple pattern’s predicate
proceeds only for her resulting in performance benefits.
                                                                  is worksFor whose domain is again the class of people and
   The proposed approach works in two phases. The first
                                                                  range the class of entities found from the vocabulary. The
phase analyzes the query with the help of vocabularies(RDFS
                                                                  third triple pattern’s predicate is numberOfStudents whose
or OWL) used to describe resources involved. This phase
                                                                  domain is the class of universities. We assume that there ex-
traverses the query graph annotating nodes with the classes
                                                                  ists a mapping which is able to map elements like professors
of data they can have. It determines the classes of data
                                                                  and universities classes and worksFor and employs predi-
that do not produce results for the query and thus helps in
                                                                  cates between the two vocabularies in the query. The third
its optimization. However, it is limited in the type of pat-
                                                                  node is therefore restricted to only the class of universities.
terns it can detect and hence calls for an additional run-time
                                                                  The vocabulary also reveals that universities employ profes-
phase. The second phase models the query execution as a
                                                                  sors. This information is then used to restrict the second
context graph and employs a run-time heuristic that dis-
                                                                  node to only the class of professors. Thus, using vocabulary
covers patterns. Many adaptive techniques have been pro-
                                                                  of the resources involved we are able to detect the pattern
posed [2] that operate at run-time. Our method is similar to
                                                                  that managers do not work for universities. Usage of this
the sideways information passing technique [7] in which the
                                                                  pattern reduces the amount of data fetched from the web
discovered patterns are passed along sideways during query
                                                                  optimizing the query.
execution. These patterns are then used to avoid retrieval
                                                                     Algorithm. The query is represented as a directed graph
of data by pruning nodes in the context graph, similar to
                                                                  and the algorithm basically works by traversing it. It begins
branch and bound technique. The heuristic is effective in
                                                                  by dereferencing URIs of all the predicates in the query. The
optimizing the query but it comes at the cost of complete-
                                                                  description of the predicates in the vocabulary give their do-
ness of results. Some patterns may be precise like a teaching
                                                                  mains and ranges. The links corresponding to the domains
Assistant of a course will not credit it which does not affect
                                                                  and ranges are also resolved and their details also fetched to
the completeness of results. But others like the faculty who
                                                                  discover further information regarding their subclasses. This
publish a paper with a student is his advisor are an approx-
                                                                  is taken care of in lines 1-4 in Algorithm 1. The query analy-
imation and its use will not give complete results.
                                                                  sis begins at the root’s(seed URI) predicate ,predicate0 , and
                                                                  proceeds by taking one outgoing path ,predicatei , at a time
2.      QUERY ANALYSIS PHASE                                      in a breadth first manner in lines 7-16. Once a path is taken,
     The query graph is divided into different levels according   its object node is annotated with the classes of its predicate’s
 range. In lines 11-13 all the domain classes of the paths from     taining the RDF descriptions of the entities fetched from
 this object node are noted and their intersection taken. If        the web. A node is added as a child to the one containing
 it is a subclass of the current class, it replaces the current     the RDF description that led to it being fetched. The edge
 class of object node in lines 14-15. Now that the node has         between the two nodes in G is labeled with the predicate
 been restricted, its ancestors are checked whether they also       name denoting the relationship between the two entities in
 can be restricted with this new information in line 16. For        them. The root is the seed which begins the execution of the
 example, if the node has been constrained to the class of          query. The results occur at the leaves whose path to the root
 universities and its parent node has the class of managers         matches their corresponding sequence of predicate terms in
 and professors. If there is schema element which says uni-         the query. The context graph is also divided into different
 versity employs professors then we can restrict its parent to      levels similar to the query graph. The context graph con-
 the class of professors. This procedure is repeated till all the   tains bindings of the nodes in the query graph. If the subject
 predicates are traversed. The final classes annotated on the       binding of a triple pattern in the context graph belongs to
 nodes, indicate that the query execution be restricted only        level l, the object binding belongs to level l + 1. The con-
 to them resulting in lesser data being fetched from web.           tents of a node of context graph contains the description
                                                                    of the entity in the shape of a star graph. The center rep-
     Algorithm 1: Query Graph Analysis                              resents the entity being described and the edges denote its
                                                                    properties and their values. The Context Graph also acts
   Input : :Query graph Q
   Output: :Updated node annotations of Q with classes              as a cache. This allows us to check whether a resource’s
              indicating that the query execution be restricted     details have already been fetched earlier before an attempt
              to them                                               is made to retrieve it from the web, which further improves
 1 foreach Predicate P of Q do                                      the query performance.
 2      Resolve P to fetch its description R
 3      parse R to find out Pdomain and Prange
 4      further fetch & parse Pdomain and Prange for their
        subclasses hierarchy
 5 list ← null
 6 append predicate0 into list
 7 while list is not empty do                                                                                      10000
 8      remove listhead & assign it to p
 9      node ← pobject ; nodeclasses ← prange                            <http://site5/CompanyA.rdf>          univ:numberOfStudents
10      domain ← universalSet                                                                                      <http://site4/UniversityA.rdf>

11      foreach outgoing predicatei from node do                                           nm1:worksFor          nm1:worksFor
12          domain ← domain ∩ predicateidomain                         <http://site2/Peter.rdf>                    <http://site3/Mary.rdf>
13          append predicatei to list
                                                                                           nm1:hasFriend    nm1:hasFriend
14       if domain ⊂ prange then                                             <http://site/Tim_lee.rdf>
15           nodeclasses ← domain                                                                           nm1:hasFriend
16           restrict nodeancestors
                                                                                                                   <http://site1/John.rdf>
                                                                                                                      nm1:worksFor


    Though this phase serves to identify classes of data which
                                                                                                                      <http://site6/CompanyB.rdf>
 do no produce results, it is limited in the patterns it can
 detect. Suppose we are given a query to find people in a de-               30                    Masters

 partment who take a graduate course. Following the above                        nm1:age
 algorithm the range of first triple pattern predicate people                                univ:degree

 is the class of faculty and students. The domain of the                              <http://site1/John.rdf>

 second triple pattern predicate takesCourse is the class of             nm1:designation   nm1:worksFor
 students. Therefore the node connecting both the triples is
 constrained to the class of students. But the third triple pat-                                                   CONTENTS OF THE
                                                                         Manager                                   NODE OF CONTEXT
 tern restricts the type of courses to graduateCourses. Since,               <http://site6/CompanyB.rdf>           GRAPH

 we do not have information in the vocabulary which says
 Graduate courses are generally only taken by students who
 have advisors we are unable to further constrain the node
 identifying the people from students to only students who
 have advisors. This calls for a run-time approach which an-
 alyzes the query execution to discover patterns which are
 missed by the query analysis phase.
                                                                                      Figure 3: Context Graph
 3.     QUERY EXECUTION PHASE
                                                                       Example. An example for a Context Graph is shown in
 3.1      Data-Structures                                           Figure 3 that models the execution of the query in Figure
                                                                    1. According to the query, Tim-Lee’s URI is resolved to
  3.1.1     Context Graph                                           fetch his RDF description from the web and is stored in
   This phase models the query execution as a context graph         the corresponding node in the context graph. The context
 G. The context graph is built by adding nodes to it con-           graph is built by creating nodes for Peter,Mary and John
after fetching their details. It continues till the data for the    a publication and if he does, display the advisor without
last query pattern is retrieved. The results in this case is the    fetching the details of publications.
node containing the value 10000 and its path to the root.               Summaries by themselves are unable to discover these re-
                                                                    lationships which calls for additional data structures. There-
3.1.2    Summaries                                                  fore two links structures are maintained between each pair
   Summaries are maintained of context nodes which are              of nodes, say n1 and n2 in the given query graph. The
used by the heuristic to make a decision on whether to ex-          two types of link structures between the pair of nodes in the
                                                                                                    n1−n2          n1−n2
plore the current node of the context graph or not. Two             query graph at level l are linkresults and linknoResults . They
summaries are associated with each node of the query graph.         are again star graphs with the center indicating a generic re-
                  qnode
One summary Sresults      - summarizes the information of all       sult or non-result producing node with weighted predicates
the result producing context nodes associated with qnode            attached to it. They are maintained as follows. If the result
                 qnode
and the other SnoResults    - summarizes the information of all     for a query is produced, the predicate denoting the relation-
the nodes that did not lead to results for qnode. A summary         ship between node bindings of n1 and n2 is inserted in the
                                                                                     n1−n2
is basically a star graph whose center represents a generic         star graph linkresults  and its weight set to 1. If the pred-
result producing or a non-result producing entity and edges         icate is already present, its weight is incremented by one.
                                                                                                                        n1−n2
denote the properties. The edges have a weight indicating           If the result for the query is not produced linknoResults     is
the number of times that the triple containing that predicate       similarly updated. This process is carried out for all pairs
and object has occurred in the nodes previously fetched. An         of nodes in the query graph. The bypassing of nodes under
edge with a high weight in result-producing summary and             certain circumstances may not be correct as it may lead to
low weight in the non-result producing summary can be used          results which would not have passed through the bypassed
as a discriminant. If a node contains such an edge then it is       nodes. Therefore, to deal with this situation, when a result
highly probable it will produce results as it is similar to the     is generated after bypassing we randomly check whether it
nodes earlier that did produce results. Summaries at any            would have also been generated by the un-bypassed original
point of time contain only the top-K high weight triples.           nodes. A variable is maintained which continuously com-
   Algorithm. When a URI is resolved to fetch a RDF de-             putes the percentage difference of produced results between
scription from the web, it is inserted in the context graph         these two situations. If it is falls below a certain threshold,
as a new node N . Lets say it is associated with node qn of         we say that bypassing of nodes is not correct and we drop
the query graph having the two summaries associated with            it.
it. If the execution continues for the new node and it pro-
                                            qn
duces results, its details are inserted in Sresults else they are   3.2    Heuristic
              qn
inserted in SnoResults . The insertion procedure is as follows.        We can optimize the SPARQL query by halting the ex-
Each triple in the new node N is compared with the triples          ploration of certain nodes in the context graph which do not
in one of the summary based on whether it produced results.         produce results. This requires the formulation of a heuristic
If both the predicate name and the object match, weight of          which makes such a decision. This is similar to the Branch &
the corresponding edge is incremented by one. Otherwise,            Bound strategy using a heuristic to prune nodes in the con-
if the number of triples in the summary is less than K, it is       text graph. Reduced number of nodes corresponds to lesser
inserted as a new triple and its weight is set to one. On the       number of URIs being dereferenced to fetch their RDF de-
other hand, if the number of triples are greater than K it          scriptions and thus lesser time taken to execute the query.
can randomly select the triple with least weight and replace        The heuristic basically works by matching the contents of
it. This executes for all the nodes constituting the path from      a node with the contents of the summaries and links asso-
N to the leaf updating summaries at all the levels with the         ciated with all nodes of the query graph. The summaries
corresponding details of nodes.                                     are compared to discover edges which have a high weight
                                                                    in either one of them. Presence of such edges in the node
3.1.3    Links                                                      are used to assess its likelihood of producing results. Links
   The query graph may have a number of nodes which have            are used to discover relationships between query nodes not
relationships amongst them other than those specified by            specified in the query graph.
the query. These relationships are discovered by analyzing             Example. For the query in Figure 1, suppose Tim-Lee has
the execution of the query. Their discovery improves the            another friend Richard who is a professor in some university.
query performance by introducing an edge denoting the re-           When his RDF description is fetched, a node is created for
lationship between the query nodes, bypassing the original          him in the context graph in Figure 3 in level 1. In order
edges and nodes if none of the bypassed nodes need to be dis-       to decide whether to continue exploring with the current
played. Therefore, the data corresponding to the bypassed           node, it is matched with the two summaries associated with
nodes need not be fetched thereby reducing the query exe-           the query node ”?friend” at the same level. Comparison of
cution time.                                                        two summaries indicates that result producing nodes contain
   Example.     Suppose we have a query which finds out             the triple defining the entity to be type Professor and non-
all the students who have authored a paper with a faculty           result producing nodes contain the triple defining the entity
member. In the query graph the nodes denoting students              of type Manager. Richard’s description is found to contain
and faculty has a node representing publications in between.        the triple describing him as a professor and therefore it is
It is executed by fetching the details of students followed by      predicted that his node will generate results.
their publication details followed by its co-authors to find a         Algorithm.     The heuristic is invoked each time a new
faculty member. During its execution we find such a faculty         node N is inserted into the context graph. The contents
member is generally an advisor of the student. Therefore,           of the new node is a description of an entity in the form
using this pattern we can check whether the student has             a star graph. Lines 2-11 in Algorithm 2 iterate over all
the nodes in the query graph. The idea here is to find if               Algorithm 2: Heuristic
N contains a predicate which denotes a direct relationship            Input : :Context graph and Query graph, New node N
between nodes in the query graph allowing us to bypass orig-          Output: :Decision on whether to explore N
inal nodes between them. The heuristic iterates over all the        1 New node N associated with node n1 in query graph.
predicates in N in line 3 and counts its matches in linkresults       foreach node n2 in query graph do
and linknoResults and stores the scores in variables scoreLr        2     foreach predicate prd in N do
and scoreLnR respectively in lines 6 and 9. If the ratio of         3         ScoreLr ← 0 ; ScoreLnR ← 0
                                                                                                         n1−n2
scoreLr to scoreLnR is greater than the threshold, we intro-        4         foreach Triple T r ∈ linkresults do
                                                                    5             if T rpredicate = prd then
duce a direct edge between the two nodes and exploration            6                 ScoreLr = T rweight
proceeds with this new edge in lines 10-11.
   Lines 12-29 in Algorithm 2 iterate over all the triples in                                                n1−n2
                                                                    7            foreach Triple T nr ∈ linknoResults do
N . Each triple in N is compared with the triples in the            8                if T nrpredicate = prd then
two summaries associated with its corresponding query node          9                    ScoreLnR = T nrweight
n1 in the query graph. This happens in lines 14-24 in the                              scoreLr >= RAT IO then
                                                                   10            if
algorithm 2. Two score variables are maintained for each                              scoreLnR
summary. scoreP O counts the predicate and object value            11                 Explore N with the next predicate prd instead of
                                                                                      the original path ; exit
matches like x advisorOf studentA whereas scoreP counts
the predicate matches like advisorOf between the node’s and        12 foreach Triple T in N do
summary’s contents. Lines 17-18 and 23-24 do the compar-           13     scoreP Oresults ← 0; scorePresults ← 0
isons and increment the scores by the weights of respective        14                            n1
                                                                          foreach Triple T r ∈ Sresults  do
edges. Ratios of the scores computed from result summaries         15         if T rpredicate = Tpredicate then
with that computed from non-result summaries are deter-            16            if T robject = Tobject then
mined. These ratios indicate the presence of edges which           17                 scoreP Oresults = T rweight
                        n                   n                                         scorePresults = scorePresults + T rweight
are predominant in Sresults   but rare in SnoResults . If either   18
of the ratios are above a certain threshold in lines 25-26, we     19       scoreP OnoResults ← 0; scorePnoResults ← 0
can say that N contains a pattern commonly found in the            20                               n1
                                                                            foreach Triple T nr ∈ SnoResults   do
result producing nodes of its level. Hence, it is predicted        21           if T nrpredicate = Tpredicate then
to produce results and the execution proceeds from the new         22              if T nrobject = Tobject then
node. The inverse of ratios computed earlier are also com-         23                  scoreP OnoResults = T nrweight
pared with the threshold in lines 27-28, if they are greater       24                 scorePnoResults = scorePnoResults + T nrweight
it means that N is very similar to the non-result produc-
ing nodes and it may also not produce results. Parameter                    if    scoreP Oresults >= RAT IO or
                                                                   25            scoreP OnoResults
p represents the probability with which a node may be ex-                      scorePresults >= RAT IO then
plored after being rejected by the heuristic. When the node                 scorePnoResults
is found to be similar to the non-result producing nodes, the      26            Explore N ; exit
probability of it being explored is diminished by a constant                if   scoreP OnoResults >= RAT IO or scorePnoResults
                                                                   27             scoreP Oresults                  scorePresults
e in line 28.                                                               then
   During the initial phase of the execution, the decisions        28            With probability δ/e Explore N and exit
made for the nodes may not be correct. This happens be-
cause the number of nodes to be compared against are less          29 Explore N with probability δ
initially. If the current node is of the type which produces
the result, it is unlikely that similar type of an entity may
have been encountered earlier. The heuristic in this case is
                                                                    denoting entities and 500,595 edges denoting the relation-
unable to determine its result producing capability. To over-
                                                                    ships between them. The efficacy of the proposed idea was
come this shortcoming a parameter δ is introduced which
                                                                    demonstrated by executing a set of complex queries on the
allows the exploration of a node even if the heuristic sug-
                                                                    simulated web of linked data of a university and comparing
gests otherwise. This is crucial during the initial part of the
                                                                    the results with the existing approach. There are a number
execution when the heuristic fails to predict the usefulness
                                                                    of patterns in the data and each of the query below demon-
of the node. But in the latter part of the execution, the
                                                                    strates the ability of the proposed approach to discover and
heuristic can bank on a lot of nodes which makes its deci-
                                                                    use them to optimize its execution. The results are judged
sion very reliable. This suggests that the parameter vary
                                                                    according to the two metrics discussed next.
continuously. It has a high value initially and it decreases
continuously with each new node that leads to a result.
                                                                       Metrics. The time taken to execute the query is pro-
                                                                    portional to the number of URIs resolved to fetch their RDF
4.   EXPERIMENTS                                                    descriptions during the course of query execution. The time
   The experiments were conducted on a Pentium 4 machine            taken to determine whether to explore a node or not is dom-
running windows XP with 1 GB main memory. All the pro-              inated by the amount of time taken to retrieve the data from
grams were written in Java. The synthetic data used for the         the web. Therefore, minimizing the number of URIs resolved
simulations was generated with the LUBM benchmark data              has a bigger impact on the the query execution time. Thus
generator [4]. The LUBM benchmark is basically an univer-           the results are judged according to two metrics. The first
sity ontology that describes all the constituents of a univer-      metric α represents the percentage reduction in the number
sity like its faculty,courses,students etc. The synthetic data      of URIs dereferenced compared with the existing approach,
is represented as a web of linked data with 200,890 nodes           indicating the degree of optimization achieved. The second
metric β denotes the percentage reduction in the complete-
ness of results compared with the existing approach.                                     Table 1: Results
                                                                        Query     Ne       Re     Np     Rp      α     β
let Ne : Number of URIs dereferenced by existing approach                Q1     135631    13911 42463 13076     69%   6%
let Re : Number of results generated by existing approach                Q2      71714     532   21362   532    71%   0%
let Np : Number of URIs dereferenced by proposed approach                Q3     145788    14316 33321 13170     77%   8%
let Rp : Number of results generated by proposed approach                Q4      72044     446   29437   446    59%   0%
                                       
         Ne − Np                  Re − Rp
   α=               ∗ 100 ; β =             ∗ 100
            Ne                      Re

Queries
Query1. Return the students in a department who take a
graduate course.
The query execution with the existing approach begins by
fetching the details of all the students in the department
followed by the details of courses taken by them. The courses
are then filtered according to whether they are of the type
graduate course or not. The proposed approach optimizes
the query by recognizing that most of the graduate courses
are taken by students who have an advisor. This is used to
avoid retrieving course details of students who do not have
an advisor.
   Query2.      Return the students whose advisors have an              Figure 4: Degree of Optimization achieved
interest in the research area ”Research9”.
The existing approach executes the query by fetching the
details of all the students followed by the details of their ad-   SPARQL query execution over the web of linked data. The
visors. The advisor’s research interests are further retrieved     approach is a two-phased strategy to discover patterns in
and checked to see if its title is ”Research9”. The query is       the query. These patterns are used to identify data that do
optimized by the proposed approach by building a list of           not contribute to answer the query and can be prevented
faculty involved in ”Research9”. Then as the students de-          from being fetched, resulting in a reduction in the query
tails are fetched, they are checked to see if their advisors       execution time. We ran simulations that demonstrated the
belong to the list involved in ”Research9”. This results in        efficacy of the proposed approach. Future work includes the
avoiding the retrieval of data regarding advisors and their        application of proposed approach to optimize conventional
researching interests not involved in ”Research9”.                 SPARQL queries over RDF data.
   Query3. Return the students who co-authored a paper
with a faculty member.                                             6.   REFERENCES
The query is executed by the existing approach by first fetch-     [1] C. Bizer, T. Heath, and T. Berners-Lee. Linked data –
ing the details of all the students followed by the details of         the story so far. International Journal on Semantic
their publications. The publication details reveals its co-            Web and Information Systems, 5(3):1–22, 2009.
authors whose details are again fetched. If any one of the         [2] A. Deshpande, Z. Ives, and V. Raman. Adaptive query
co-author is a faculty member, his URI is displayed as a               processing. Found. Trends databases, 1(1):1–140, 2007.
result along with the student’s. The proposed approach dis-        [3] M. Franklin. From databases to dataspaces: A new
covers that most of such faculty are infact advisors of the            abstraction for information management. SIGMOD
students. Therefore, as the student details are fetched, it is         Record, 34:27–33, 2005.
parsed to see if he authored a paper. If he did, his advisor       [4] Y. Guo, Z. Pan, and J. Heflin. Lubm: A benchmark for
is displayed as the co-author. This optimization results in            owl knowledge base systems. J. Web Sem.,
the avoidance of retrieval of data pertaining to publications          3(2-3):158–182, 2005.
and its co-authors.                                                [5] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K.-U.
   Query4. Return the faculty who take courses which are               Sattler, and J. Umbrich. Data summaries for
attended by the students they advise.                                  on-demand queries over linked data. In WWW ’10:
   The query is executed by the existing approach by first             Proceedings of the 19th international conference on
fetching the details of all the faculty followed by the details        World wide web, pages 411–420, New York, NY, USA,
of courses taken by them. This is followed by fetching the             2010. ACM.
details of all the students who take these courses. Then the       [6] O. Hartig, C. Bizer, and J.-C. Freytag. Executing
students details are checked to see whether they are guided            sparql queries over the web of linked data. In 8th
by the professor who teaches that course. The proposed                 International Semantic Web Conference (ISWC2009),
approach discovers the fact that lecturers take courses but            October 2009.
do not advise students. This fact is used to optimize the
                                                                   [7] Z. G. Ives and N. E. Taylor. Sideways information
query by not fetching the details of lecturers’s courses and
                                                                       passing for push-style query processing. In ICDE, pages
the students who take those courses.
                                                                       774–783. IEEE, 2008.
                                                                   [8] E. Prud’hommeaux and A. Seaborne. SPARQL query
5.     CONCLUSIONS AND FUTURE WORK                                     language for RDF. W3C recommendation, World Wide
     In this paper we present an approach to optimize the              Web Consortium, 2008.

</pre>