=Paper=
{{Paper
|id=None
|storemode=property
|title=Optimizing SPARQL Queries over the Web of Linked Data
|pdfUrl=https://ceur-ws.org/Vol-637/paper2.pdf
|volume=Vol-637
}}
==Optimizing SPARQL Queries over the Web of Linked Data==
Optimizing SPARQL queries over the Web of Linked Data
B. R. Kuldeep Reddy P. Sreenivasa Kumar
Indian Institute of Technology Madras Indian Institute of Technology Madras
Chennai, India Chennai, India
brkreddy@cse.iitm.ac.in psk@cse.iitm.ac.in
ABSTRACT mantics[1] and the links between documents are not expres-
The web of linked data represents a globally distributed sive enough to establish the relationship between them. This
dataspace. It can be queried with SPARQL whose execution has lead to the emergence of the global data space known as
takes place by asynchronously traversing the RDF links to Linked Data[1]. Linked data basically interconnects pieces
discover data sources at run-time. However, the optimiza- of data from different sources utilizing the existing web in-
tion of SPARQL queries over the web of data remains a chal- frastructure. The data published is machine readable which
lenge and in this paper we present an approach addressing means it is explicitly defined. Instead of using HTML, linked
this problem. The proposed approach works in two-phases data uses RDF format to represent data. The connection
to optimize the SPARQL queries. The first phase analyzes between data is made by typed statements in RDF which
the query before its execution and discovers classes of data clearly defines the relationship between them resulting in
that do not contribute towards answering it which can then a web of data. The Linked Data Principles outlined by
be prevented from being fetched. However, this analysis Berners-Lee for publishing data on the web basically sug-
may miss a number of patterns that can only be discovered gests using URIs for names of things which are described in
at run-time. The second phase analyzes the query execu- RDF format and accessed using HTTP protocol.
tion to discover more patterns which are used to further The RDF model describes data in the form of subject,
reduce the amount of data fetched from the web to answer predicate and object triples. The subject and object of a
the query. The main idea of this phase is to model the query triple can be both URIs that each identify an entity, or a
execution as a context graph that is used by a heuristic to URI and a string value respectively. The predicate denotes
discover the patterns which are passed sideways to prune the relationship between the subject and object, and is also
nodes in the context graph, resulting in an improvement in represented by a URI. SPARQL is the query language pro-
query performance. The implementation of our approach posed by W3C recommendation to query RDF data[8]. A
demonstrates its benefits. SPARQL query basically consists of a set of triple patterns.
It can have variables in the subject,object or predicate po-
sitions in each of the triple pattern. The solution consists
1. INTRODUCTION of binding these variables to entities which are related with
Semantic data management includes the various techniques each other in the RDF model according to the query struc-
that use data based on its meaning. There has been a recent ture. An execution approach for evaluating SPARQL queries
growth in publishing data on the web following the linked on linked data is proposed in [6]. The idea presented in [5]
data principles forming a global web of linked data. Efficient maintains index structures that are used to select relevant
querying of this web of linked data is thus an important com- sources for query answering, as in federated query process-
ponent of Semantic Data Management and is addressed in ing. However, the index covering large amount of data on
this paper. the Web requires much more resources. According to [6] a
The traditional World Wide Web has allowed sharing of SPARQL query is basically executed by iteratively derefer-
documents among users on a global scale. The documents encing URIs to fetch their RDF descriptions from the web
are generally represented in HTML, XML formats and are and building solutions from the retrieved data. It must have
accessed using URL and HTTP protocols creating a global a seed URI as the subject of the first query pattern. It is
information space. However, in the recent years the web has explained with an example below.
evolved towards a web of data[3] as the conventional web’s
data representation sacrifices much of its structure and se-
To copy without fee all or part of this material is permitted only for private
and academic purposes, given that the title of the publication, the authors
and its date of publication appear. Copying or use for commercial purposes,
or to republish, to post on servers or to redistribute to lists, is forbidden
unless an explicit permission is acquired from the copyright owners; the Figure 1: Example SPARQL query
authors of the material.
Workshop on Semantic Data Management (SemData@VLDB) 2010,
September 17, 2010, Singapore. Example. The SPARQL query shown in Figure 1 searches
Copyright 2010: www.semdata.org. for friends of Tim-Lee who work for an entity and the num-
ber of students in that entity. The query execution begins to its structure. A level is associated with each triple pat-
by fetching the RDF description of Tim-Lee by dereferenc- tern in the query. The level of a node in the query graph can
ing his URI. The fetched RDF description is then parsed also be viewed as its distance from the root in terms of num-
to gather a list of his friends. Parsing is done by looking ber of predicates. Before the query execution can begin, we
for triples that match the first pattern in the query. The analyze it with the help of vocabularies used to describe re-
object URIs in the matched triples form the list of Tim- sources in this phase(we assume a mapping exists that maps
Lee’s friends. Lets say , , were found [1] it is considered good practice to do so). The idea is to
to be his friends. The query execution proceeds by fetch- identify classes of data which do not contribute during the
ing the RDF descriptions corresponding to John,Peter and query execution in that they do not produce results. The
Mary. Lets say first Mary’s graph is retrieved. It is parsed data corresponding to those classes is deemed redundant and
to check for triples matching the second query pattern and may not be fetched from the web. The vocabulary contains
it is found that Mary works for a university . The university’s details are again fetched the subclass hierarchy details. This information can be ex-
and the third triple pattern in the query is searched in the ploited to restrict the query execution to only those classes
graph to get the number of students. Peter’s and John’s which generate results.
graphs and their companies details would also be retrieved
and the query execution proceeded in a way similar to Mary’s.
But since Peter and John work for corporate entities which
do not have students, the third triple pattern is not answered
for them.
The SPARQL query shown in Figure 1 can be optimized
as follows. Lets say Tim-Lee has friends who are either pro-
fessors or managers. Suppose Mary is a professor and Peter
and John are managers. The professor works for a univer-
sity and managers for a corporate entity. The third triple
pattern in the query can only be answered if the entity is a
university as only a university has students. Fetching the de-
tails of managers and the companies they work for does not
result in the third triple pattern being answered and there- Figure 2: Example Query Graph
fore they need not be fetched. The decision on whether to
continue the execution for a friend can be taken immediately
Example. Let us consider the query of Figure 1 repre-
when his RDF description is fetched thus avoiding retriev-
sented as a graph in Figure 2. The first triple pattern’s
ing his company information in-case he is a manager. In this
predicate is hasFriend whose vocabulary description reveals
example when John’s,Peter’s and Mary’s data is fetched it is
that its range is the class of people who can be either man-
determined that only Mary is a professor and the execution
agers or professors. The second triple pattern’s predicate
proceeds only for her resulting in performance benefits.
is worksFor whose domain is again the class of people and
The proposed approach works in two phases. The first
range the class of entities found from the vocabulary. The
phase analyzes the query with the help of vocabularies(RDFS
third triple pattern’s predicate is numberOfStudents whose
or OWL) used to describe resources involved. This phase
domain is the class of universities. We assume that there ex-
traverses the query graph annotating nodes with the classes
ists a mapping which is able to map elements like professors
of data they can have. It determines the classes of data
and universities classes and worksFor and employs predi-
that do not produce results for the query and thus helps in
cates between the two vocabularies in the query. The third
its optimization. However, it is limited in the type of pat-
node is therefore restricted to only the class of universities.
terns it can detect and hence calls for an additional run-time
The vocabulary also reveals that universities employ profes-
phase. The second phase models the query execution as a
sors. This information is then used to restrict the second
context graph and employs a run-time heuristic that dis-
node to only the class of professors. Thus, using vocabulary
covers patterns. Many adaptive techniques have been pro-
of the resources involved we are able to detect the pattern
posed [2] that operate at run-time. Our method is similar to
that managers do not work for universities. Usage of this
the sideways information passing technique [7] in which the
pattern reduces the amount of data fetched from the web
discovered patterns are passed along sideways during query
optimizing the query.
execution. These patterns are then used to avoid retrieval
Algorithm. The query is represented as a directed graph
of data by pruning nodes in the context graph, similar to
and the algorithm basically works by traversing it. It begins
branch and bound technique. The heuristic is effective in
by dereferencing URIs of all the predicates in the query. The
optimizing the query but it comes at the cost of complete-
description of the predicates in the vocabulary give their do-
ness of results. Some patterns may be precise like a teaching
mains and ranges. The links corresponding to the domains
Assistant of a course will not credit it which does not affect
and ranges are also resolved and their details also fetched to
the completeness of results. But others like the faculty who
discover further information regarding their subclasses. This
publish a paper with a student is his advisor are an approx-
is taken care of in lines 1-4 in Algorithm 1. The query analy-
imation and its use will not give complete results.
sis begins at the root’s(seed URI) predicate ,predicate0 , and
proceeds by taking one outgoing path ,predicatei , at a time
2. QUERY ANALYSIS PHASE in a breadth first manner in lines 7-16. Once a path is taken,
The query graph is divided into different levels according its object node is annotated with the classes of its predicate’s
range. In lines 11-13 all the domain classes of the paths from taining the RDF descriptions of the entities fetched from
this object node are noted and their intersection taken. If the web. A node is added as a child to the one containing
it is a subclass of the current class, it replaces the current the RDF description that led to it being fetched. The edge
class of object node in lines 14-15. Now that the node has between the two nodes in G is labeled with the predicate
been restricted, its ancestors are checked whether they also name denoting the relationship between the two entities in
can be restricted with this new information in line 16. For them. The root is the seed which begins the execution of the
example, if the node has been constrained to the class of query. The results occur at the leaves whose path to the root
universities and its parent node has the class of managers matches their corresponding sequence of predicate terms in
and professors. If there is schema element which says uni- the query. The context graph is also divided into different
versity employs professors then we can restrict its parent to levels similar to the query graph. The context graph con-
the class of professors. This procedure is repeated till all the tains bindings of the nodes in the query graph. If the subject
predicates are traversed. The final classes annotated on the binding of a triple pattern in the context graph belongs to
nodes, indicate that the query execution be restricted only level l, the object binding belongs to level l + 1. The con-
to them resulting in lesser data being fetched from web. tents of a node of context graph contains the description
of the entity in the shape of a star graph. The center rep-
Algorithm 1: Query Graph Analysis resents the entity being described and the edges denote its
properties and their values. The Context Graph also acts
Input : :Query graph Q
Output: :Updated node annotations of Q with classes as a cache. This allows us to check whether a resource’s
indicating that the query execution be restricted details have already been fetched earlier before an attempt
to them is made to retrieve it from the web, which further improves
1 foreach Predicate P of Q do the query performance.
2 Resolve P to fetch its description R
3 parse R to find out Pdomain and Prange
4 further fetch & parse Pdomain and Prange for their
subclasses hierarchy
5 list ← null
6 append predicate0 into list
7 while list is not empty do 10000
8 remove listhead & assign it to p
9 node ← pobject ; nodeclasses ← prange univ:numberOfStudents
10 domain ← universalSet
11 foreach outgoing predicatei from node do nm1:worksFor nm1:worksFor
12 domain ← domain ∩ predicateidomain
13 append predicatei to list
nm1:hasFriend nm1:hasFriend
14 if domain ⊂ prange then
15 nodeclasses ← domain nm1:hasFriend
16 restrict nodeancestors
nm1:worksFor
Though this phase serves to identify classes of data which
do no produce results, it is limited in the patterns it can
detect. Suppose we are given a query to find people in a de- 30 Masters
partment who take a graduate course. Following the above nm1:age
algorithm the range of first triple pattern predicate people univ:degree
is the class of faculty and students. The domain of the
second triple pattern predicate takesCourse is the class of nm1:designation nm1:worksFor
students. Therefore the node connecting both the triples is
constrained to the class of students. But the third triple pat- CONTENTS OF THE
Manager NODE OF CONTEXT
tern restricts the type of courses to graduateCourses. Since, GRAPH
we do not have information in the vocabulary which says
Graduate courses are generally only taken by students who
have advisors we are unable to further constrain the node
identifying the people from students to only students who
have advisors. This calls for a run-time approach which an-
alyzes the query execution to discover patterns which are
missed by the query analysis phase.
Figure 3: Context Graph
3. QUERY EXECUTION PHASE
Example. An example for a Context Graph is shown in
3.1 Data-Structures Figure 3 that models the execution of the query in Figure
1. According to the query, Tim-Lee’s URI is resolved to
3.1.1 Context Graph fetch his RDF description from the web and is stored in
This phase models the query execution as a context graph the corresponding node in the context graph. The context
G. The context graph is built by adding nodes to it con- graph is built by creating nodes for Peter,Mary and John
after fetching their details. It continues till the data for the a publication and if he does, display the advisor without
last query pattern is retrieved. The results in this case is the fetching the details of publications.
node containing the value 10000 and its path to the root. Summaries by themselves are unable to discover these re-
lationships which calls for additional data structures. There-
3.1.2 Summaries fore two links structures are maintained between each pair
Summaries are maintained of context nodes which are of nodes, say n1 and n2 in the given query graph. The
used by the heuristic to make a decision on whether to ex- two types of link structures between the pair of nodes in the
n1−n2 n1−n2
plore the current node of the context graph or not. Two query graph at level l are linkresults and linknoResults . They
summaries are associated with each node of the query graph. are again star graphs with the center indicating a generic re-
qnode
One summary Sresults - summarizes the information of all sult or non-result producing node with weighted predicates
the result producing context nodes associated with qnode attached to it. They are maintained as follows. If the result
qnode
and the other SnoResults - summarizes the information of all for a query is produced, the predicate denoting the relation-
the nodes that did not lead to results for qnode. A summary ship between node bindings of n1 and n2 is inserted in the
n1−n2
is basically a star graph whose center represents a generic star graph linkresults and its weight set to 1. If the pred-
result producing or a non-result producing entity and edges icate is already present, its weight is incremented by one.
n1−n2
denote the properties. The edges have a weight indicating If the result for the query is not produced linknoResults is
the number of times that the triple containing that predicate similarly updated. This process is carried out for all pairs
and object has occurred in the nodes previously fetched. An of nodes in the query graph. The bypassing of nodes under
edge with a high weight in result-producing summary and certain circumstances may not be correct as it may lead to
low weight in the non-result producing summary can be used results which would not have passed through the bypassed
as a discriminant. If a node contains such an edge then it is nodes. Therefore, to deal with this situation, when a result
highly probable it will produce results as it is similar to the is generated after bypassing we randomly check whether it
nodes earlier that did produce results. Summaries at any would have also been generated by the un-bypassed original
point of time contain only the top-K high weight triples. nodes. A variable is maintained which continuously com-
Algorithm. When a URI is resolved to fetch a RDF de- putes the percentage difference of produced results between
scription from the web, it is inserted in the context graph these two situations. If it is falls below a certain threshold,
as a new node N . Lets say it is associated with node qn of we say that bypassing of nodes is not correct and we drop
the query graph having the two summaries associated with it.
it. If the execution continues for the new node and it pro-
qn
duces results, its details are inserted in Sresults else they are 3.2 Heuristic
qn
inserted in SnoResults . The insertion procedure is as follows. We can optimize the SPARQL query by halting the ex-
Each triple in the new node N is compared with the triples ploration of certain nodes in the context graph which do not
in one of the summary based on whether it produced results. produce results. This requires the formulation of a heuristic
If both the predicate name and the object match, weight of which makes such a decision. This is similar to the Branch &
the corresponding edge is incremented by one. Otherwise, Bound strategy using a heuristic to prune nodes in the con-
if the number of triples in the summary is less than K, it is text graph. Reduced number of nodes corresponds to lesser
inserted as a new triple and its weight is set to one. On the number of URIs being dereferenced to fetch their RDF de-
other hand, if the number of triples are greater than K it scriptions and thus lesser time taken to execute the query.
can randomly select the triple with least weight and replace The heuristic basically works by matching the contents of
it. This executes for all the nodes constituting the path from a node with the contents of the summaries and links asso-
N to the leaf updating summaries at all the levels with the ciated with all nodes of the query graph. The summaries
corresponding details of nodes. are compared to discover edges which have a high weight
in either one of them. Presence of such edges in the node
3.1.3 Links are used to assess its likelihood of producing results. Links
The query graph may have a number of nodes which have are used to discover relationships between query nodes not
relationships amongst them other than those specified by specified in the query graph.
the query. These relationships are discovered by analyzing Example. For the query in Figure 1, suppose Tim-Lee has
the execution of the query. Their discovery improves the another friend Richard who is a professor in some university.
query performance by introducing an edge denoting the re- When his RDF description is fetched, a node is created for
lationship between the query nodes, bypassing the original him in the context graph in Figure 3 in level 1. In order
edges and nodes if none of the bypassed nodes need to be dis- to decide whether to continue exploring with the current
played. Therefore, the data corresponding to the bypassed node, it is matched with the two summaries associated with
nodes need not be fetched thereby reducing the query exe- the query node ”?friend” at the same level. Comparison of
cution time. two summaries indicates that result producing nodes contain
Example. Suppose we have a query which finds out the triple defining the entity to be type Professor and non-
all the students who have authored a paper with a faculty result producing nodes contain the triple defining the entity
member. In the query graph the nodes denoting students of type Manager. Richard’s description is found to contain
and faculty has a node representing publications in between. the triple describing him as a professor and therefore it is
It is executed by fetching the details of students followed by predicted that his node will generate results.
their publication details followed by its co-authors to find a Algorithm. The heuristic is invoked each time a new
faculty member. During its execution we find such a faculty node N is inserted into the context graph. The contents
member is generally an advisor of the student. Therefore, of the new node is a description of an entity in the form
using this pattern we can check whether the student has a star graph. Lines 2-11 in Algorithm 2 iterate over all
the nodes in the query graph. The idea here is to find if Algorithm 2: Heuristic
N contains a predicate which denotes a direct relationship Input : :Context graph and Query graph, New node N
between nodes in the query graph allowing us to bypass orig- Output: :Decision on whether to explore N
inal nodes between them. The heuristic iterates over all the 1 New node N associated with node n1 in query graph.
predicates in N in line 3 and counts its matches in linkresults foreach node n2 in query graph do
and linknoResults and stores the scores in variables scoreLr 2 foreach predicate prd in N do
and scoreLnR respectively in lines 6 and 9. If the ratio of 3 ScoreLr ← 0 ; ScoreLnR ← 0
n1−n2
scoreLr to scoreLnR is greater than the threshold, we intro- 4 foreach Triple T r ∈ linkresults do
5 if T rpredicate = prd then
duce a direct edge between the two nodes and exploration 6 ScoreLr = T rweight
proceeds with this new edge in lines 10-11.
Lines 12-29 in Algorithm 2 iterate over all the triples in n1−n2
7 foreach Triple T nr ∈ linknoResults do
N . Each triple in N is compared with the triples in the 8 if T nrpredicate = prd then
two summaries associated with its corresponding query node 9 ScoreLnR = T nrweight
n1 in the query graph. This happens in lines 14-24 in the scoreLr >= RAT IO then
10 if
algorithm 2. Two score variables are maintained for each scoreLnR
summary. scoreP O counts the predicate and object value 11 Explore N with the next predicate prd instead of
the original path ; exit
matches like x advisorOf studentA whereas scoreP counts
the predicate matches like advisorOf between the node’s and 12 foreach Triple T in N do
summary’s contents. Lines 17-18 and 23-24 do the compar- 13 scoreP Oresults ← 0; scorePresults ← 0
isons and increment the scores by the weights of respective 14 n1
foreach Triple T r ∈ Sresults do
edges. Ratios of the scores computed from result summaries 15 if T rpredicate = Tpredicate then
with that computed from non-result summaries are deter- 16 if T robject = Tobject then
mined. These ratios indicate the presence of edges which 17 scoreP Oresults = T rweight
n n scorePresults = scorePresults + T rweight
are predominant in Sresults but rare in SnoResults . If either 18
of the ratios are above a certain threshold in lines 25-26, we 19 scoreP OnoResults ← 0; scorePnoResults ← 0
can say that N contains a pattern commonly found in the 20 n1
foreach Triple T nr ∈ SnoResults do
result producing nodes of its level. Hence, it is predicted 21 if T nrpredicate = Tpredicate then
to produce results and the execution proceeds from the new 22 if T nrobject = Tobject then
node. The inverse of ratios computed earlier are also com- 23 scoreP OnoResults = T nrweight
pared with the threshold in lines 27-28, if they are greater 24 scorePnoResults = scorePnoResults + T nrweight
it means that N is very similar to the non-result produc-
ing nodes and it may also not produce results. Parameter if scoreP Oresults >= RAT IO or
25 scoreP OnoResults
p represents the probability with which a node may be ex- scorePresults >= RAT IO then
plored after being rejected by the heuristic. When the node scorePnoResults
is found to be similar to the non-result producing nodes, the 26 Explore N ; exit
probability of it being explored is diminished by a constant if scoreP OnoResults >= RAT IO or scorePnoResults
27 scoreP Oresults scorePresults
e in line 28. then
During the initial phase of the execution, the decisions 28 With probability δ/e Explore N and exit
made for the nodes may not be correct. This happens be-
cause the number of nodes to be compared against are less 29 Explore N with probability δ
initially. If the current node is of the type which produces
the result, it is unlikely that similar type of an entity may
have been encountered earlier. The heuristic in this case is
denoting entities and 500,595 edges denoting the relation-
unable to determine its result producing capability. To over-
ships between them. The efficacy of the proposed idea was
come this shortcoming a parameter δ is introduced which
demonstrated by executing a set of complex queries on the
allows the exploration of a node even if the heuristic sug-
simulated web of linked data of a university and comparing
gests otherwise. This is crucial during the initial part of the
the results with the existing approach. There are a number
execution when the heuristic fails to predict the usefulness
of patterns in the data and each of the query below demon-
of the node. But in the latter part of the execution, the
strates the ability of the proposed approach to discover and
heuristic can bank on a lot of nodes which makes its deci-
use them to optimize its execution. The results are judged
sion very reliable. This suggests that the parameter vary
according to the two metrics discussed next.
continuously. It has a high value initially and it decreases
continuously with each new node that leads to a result.
Metrics. The time taken to execute the query is pro-
portional to the number of URIs resolved to fetch their RDF
4. EXPERIMENTS descriptions during the course of query execution. The time
The experiments were conducted on a Pentium 4 machine taken to determine whether to explore a node or not is dom-
running windows XP with 1 GB main memory. All the pro- inated by the amount of time taken to retrieve the data from
grams were written in Java. The synthetic data used for the the web. Therefore, minimizing the number of URIs resolved
simulations was generated with the LUBM benchmark data has a bigger impact on the the query execution time. Thus
generator [4]. The LUBM benchmark is basically an univer- the results are judged according to two metrics. The first
sity ontology that describes all the constituents of a univer- metric α represents the percentage reduction in the number
sity like its faculty,courses,students etc. The synthetic data of URIs dereferenced compared with the existing approach,
is represented as a web of linked data with 200,890 nodes indicating the degree of optimization achieved. The second
metric β denotes the percentage reduction in the complete-
ness of results compared with the existing approach. Table 1: Results
Query Ne Re Np Rp α β
let Ne : Number of URIs dereferenced by existing approach Q1 135631 13911 42463 13076 69% 6%
let Re : Number of results generated by existing approach Q2 71714 532 21362 532 71% 0%
let Np : Number of URIs dereferenced by proposed approach Q3 145788 14316 33321 13170 77% 8%
let Rp : Number of results generated by proposed approach Q4 72044 446 29437 446 59% 0%
Ne − Np Re − Rp
α= ∗ 100 ; β = ∗ 100
Ne Re
Queries
Query1. Return the students in a department who take a
graduate course.
The query execution with the existing approach begins by
fetching the details of all the students in the department
followed by the details of courses taken by them. The courses
are then filtered according to whether they are of the type
graduate course or not. The proposed approach optimizes
the query by recognizing that most of the graduate courses
are taken by students who have an advisor. This is used to
avoid retrieving course details of students who do not have
an advisor.
Query2. Return the students whose advisors have an Figure 4: Degree of Optimization achieved
interest in the research area ”Research9”.
The existing approach executes the query by fetching the
details of all the students followed by the details of their ad- SPARQL query execution over the web of linked data. The
visors. The advisor’s research interests are further retrieved approach is a two-phased strategy to discover patterns in
and checked to see if its title is ”Research9”. The query is the query. These patterns are used to identify data that do
optimized by the proposed approach by building a list of not contribute to answer the query and can be prevented
faculty involved in ”Research9”. Then as the students de- from being fetched, resulting in a reduction in the query
tails are fetched, they are checked to see if their advisors execution time. We ran simulations that demonstrated the
belong to the list involved in ”Research9”. This results in efficacy of the proposed approach. Future work includes the
avoiding the retrieval of data regarding advisors and their application of proposed approach to optimize conventional
researching interests not involved in ”Research9”. SPARQL queries over RDF data.
Query3. Return the students who co-authored a paper
with a faculty member. 6. REFERENCES
The query is executed by the existing approach by first fetch- [1] C. Bizer, T. Heath, and T. Berners-Lee. Linked data –
ing the details of all the students followed by the details of the story so far. International Journal on Semantic
their publications. The publication details reveals its co- Web and Information Systems, 5(3):1–22, 2009.
authors whose details are again fetched. If any one of the [2] A. Deshpande, Z. Ives, and V. Raman. Adaptive query
co-author is a faculty member, his URI is displayed as a processing. Found. Trends databases, 1(1):1–140, 2007.
result along with the student’s. The proposed approach dis- [3] M. Franklin. From databases to dataspaces: A new
covers that most of such faculty are infact advisors of the abstraction for information management. SIGMOD
students. Therefore, as the student details are fetched, it is Record, 34:27–33, 2005.
parsed to see if he authored a paper. If he did, his advisor [4] Y. Guo, Z. Pan, and J. Heflin. Lubm: A benchmark for
is displayed as the co-author. This optimization results in owl knowledge base systems. J. Web Sem.,
the avoidance of retrieval of data pertaining to publications 3(2-3):158–182, 2005.
and its co-authors. [5] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K.-U.
Query4. Return the faculty who take courses which are Sattler, and J. Umbrich. Data summaries for
attended by the students they advise. on-demand queries over linked data. In WWW ’10:
The query is executed by the existing approach by first Proceedings of the 19th international conference on
fetching the details of all the faculty followed by the details World wide web, pages 411–420, New York, NY, USA,
of courses taken by them. This is followed by fetching the 2010. ACM.
details of all the students who take these courses. Then the [6] O. Hartig, C. Bizer, and J.-C. Freytag. Executing
students details are checked to see whether they are guided sparql queries over the web of linked data. In 8th
by the professor who teaches that course. The proposed International Semantic Web Conference (ISWC2009),
approach discovers the fact that lecturers take courses but October 2009.
do not advise students. This fact is used to optimize the
[7] Z. G. Ives and N. E. Taylor. Sideways information
query by not fetching the details of lecturers’s courses and
passing for push-style query processing. In ICDE, pages
the students who take those courses.
774–783. IEEE, 2008.
[8] E. Prud’hommeaux and A. Seaborne. SPARQL query
5. CONCLUSIONS AND FUTURE WORK language for RDF. W3C recommendation, World Wide
In this paper we present an approach to optimize the Web Consortium, 2008.