Optimizing SPARQL queries over the Web of Linked Data B. R. Kuldeep Reddy P. Sreenivasa Kumar Indian Institute of Technology Madras Indian Institute of Technology Madras Chennai, India Chennai, India brkreddy@cse.iitm.ac.in psk@cse.iitm.ac.in ABSTRACT mantics[1] and the links between documents are not expres- The web of linked data represents a globally distributed sive enough to establish the relationship between them. This dataspace. It can be queried with SPARQL whose execution has lead to the emergence of the global data space known as takes place by asynchronously traversing the RDF links to Linked Data[1]. Linked data basically interconnects pieces discover data sources at run-time. However, the optimiza- of data from different sources utilizing the existing web in- tion of SPARQL queries over the web of data remains a chal- frastructure. The data published is machine readable which lenge and in this paper we present an approach addressing means it is explicitly defined. Instead of using HTML, linked this problem. The proposed approach works in two-phases data uses RDF format to represent data. The connection to optimize the SPARQL queries. The first phase analyzes between data is made by typed statements in RDF which the query before its execution and discovers classes of data clearly defines the relationship between them resulting in that do not contribute towards answering it which can then a web of data. The Linked Data Principles outlined by be prevented from being fetched. However, this analysis Berners-Lee for publishing data on the web basically sug- may miss a number of patterns that can only be discovered gests using URIs for names of things which are described in at run-time. The second phase analyzes the query execu- RDF format and accessed using HTTP protocol. tion to discover more patterns which are used to further The RDF model describes data in the form of subject, reduce the amount of data fetched from the web to answer predicate and object triples. The subject and object of a the query. The main idea of this phase is to model the query triple can be both URIs that each identify an entity, or a execution as a context graph that is used by a heuristic to URI and a string value respectively. The predicate denotes discover the patterns which are passed sideways to prune the relationship between the subject and object, and is also nodes in the context graph, resulting in an improvement in represented by a URI. SPARQL is the query language pro- query performance. The implementation of our approach posed by W3C recommendation to query RDF data[8]. A demonstrates its benefits. SPARQL query basically consists of a set of triple patterns. It can have variables in the subject,object or predicate po- sitions in each of the triple pattern. The solution consists 1. INTRODUCTION of binding these variables to entities which are related with Semantic data management includes the various techniques each other in the RDF model according to the query struc- that use data based on its meaning. There has been a recent ture. An execution approach for evaluating SPARQL queries growth in publishing data on the web following the linked on linked data is proposed in [6]. The idea presented in [5] data principles forming a global web of linked data. Efficient maintains index structures that are used to select relevant querying of this web of linked data is thus an important com- sources for query answering, as in federated query process- ponent of Semantic Data Management and is addressed in ing. However, the index covering large amount of data on this paper. the Web requires much more resources. According to [6] a The traditional World Wide Web has allowed sharing of SPARQL query is basically executed by iteratively derefer- documents among users on a global scale. The documents encing URIs to fetch their RDF descriptions from the web are generally represented in HTML, XML formats and are and building solutions from the retrieved data. It must have accessed using URL and HTTP protocols creating a global a seed URI as the subject of the first query pattern. It is information space. However, in the recent years the web has explained with an example below. evolved towards a web of data[3] as the conventional web’s data representation sacrifices much of its structure and se- To copy without fee all or part of this material is permitted only for private and academic purposes, given that the title of the publication, the authors and its date of publication appear. Copying or use for commercial purposes, or to republish, to post on servers or to redistribute to lists, is forbidden unless an explicit permission is acquired from the copyright owners; the Figure 1: Example SPARQL query authors of the material. Workshop on Semantic Data Management (SemData@VLDB) 2010, September 17, 2010, Singapore. Example. The SPARQL query shown in Figure 1 searches Copyright 2010: www.semdata.org. for friends of Tim-Lee who work for an entity and the num- ber of students in that entity. The query execution begins to its structure. A level is associated with each triple pat- by fetching the RDF description of Tim-Lee by dereferenc- tern in the query. The level of a node in the query graph can ing his URI. The fetched RDF description is then parsed also be viewed as its distance from the root in terms of num- to gather a list of his friends. Parsing is done by looking ber of predicates. Before the query execution can begin, we for triples that match the first pattern in the query. The analyze it with the help of vocabularies used to describe re- object URIs in the matched triples form the list of Tim- sources in this phase(we assume a mapping exists that maps Lee’s friends. Lets say , , were found [1] it is considered good practice to do so). The idea is to to be his friends. The query execution proceeds by fetch- identify classes of data which do not contribute during the ing the RDF descriptions corresponding to John,Peter and query execution in that they do not produce results. The Mary. Lets say first Mary’s graph is retrieved. It is parsed data corresponding to those classes is deemed redundant and to check for triples matching the second query pattern and may not be fetched from the web. The vocabulary contains it is found that Mary works for a university . The university’s details are again fetched the subclass hierarchy details. This information can be ex- and the third triple pattern in the query is searched in the ploited to restrict the query execution to only those classes graph to get the number of students. Peter’s and John’s which generate results. graphs and their companies details would also be retrieved and the query execution proceeded in a way similar to Mary’s. But since Peter and John work for corporate entities which do not have students, the third triple pattern is not answered for them. The SPARQL query shown in Figure 1 can be optimized as follows. Lets say Tim-Lee has friends who are either pro- fessors or managers. Suppose Mary is a professor and Peter and John are managers. The professor works for a univer- sity and managers for a corporate entity. The third triple pattern in the query can only be answered if the entity is a university as only a university has students. Fetching the de- tails of managers and the companies they work for does not result in the third triple pattern being answered and there- Figure 2: Example Query Graph fore they need not be fetched. The decision on whether to continue the execution for a friend can be taken immediately Example. Let us consider the query of Figure 1 repre- when his RDF description is fetched thus avoiding retriev- sented as a graph in Figure 2. The first triple pattern’s ing his company information in-case he is a manager. In this predicate is hasFriend whose vocabulary description reveals example when John’s,Peter’s and Mary’s data is fetched it is that its range is the class of people who can be either man- determined that only Mary is a professor and the execution agers or professors. The second triple pattern’s predicate proceeds only for her resulting in performance benefits. is worksFor whose domain is again the class of people and The proposed approach works in two phases. The first range the class of entities found from the vocabulary. The phase analyzes the query with the help of vocabularies(RDFS third triple pattern’s predicate is numberOfStudents whose or OWL) used to describe resources involved. This phase domain is the class of universities. We assume that there ex- traverses the query graph annotating nodes with the classes ists a mapping which is able to map elements like professors of data they can have. It determines the classes of data and universities classes and worksFor and employs predi- that do not produce results for the query and thus helps in cates between the two vocabularies in the query. The third its optimization. However, it is limited in the type of pat- node is therefore restricted to only the class of universities. terns it can detect and hence calls for an additional run-time The vocabulary also reveals that universities employ profes- phase. The second phase models the query execution as a sors. This information is then used to restrict the second context graph and employs a run-time heuristic that dis- node to only the class of professors. Thus, using vocabulary covers patterns. Many adaptive techniques have been pro- of the resources involved we are able to detect the pattern posed [2] that operate at run-time. Our method is similar to that managers do not work for universities. Usage of this the sideways information passing technique [7] in which the pattern reduces the amount of data fetched from the web discovered patterns are passed along sideways during query optimizing the query. execution. These patterns are then used to avoid retrieval Algorithm. The query is represented as a directed graph of data by pruning nodes in the context graph, similar to and the algorithm basically works by traversing it. It begins branch and bound technique. The heuristic is effective in by dereferencing URIs of all the predicates in the query. The optimizing the query but it comes at the cost of complete- description of the predicates in the vocabulary give their do- ness of results. Some patterns may be precise like a teaching mains and ranges. The links corresponding to the domains Assistant of a course will not credit it which does not affect and ranges are also resolved and their details also fetched to the completeness of results. But others like the faculty who discover further information regarding their subclasses. This publish a paper with a student is his advisor are an approx- is taken care of in lines 1-4 in Algorithm 1. The query analy- imation and its use will not give complete results. sis begins at the root’s(seed URI) predicate ,predicate0 , and proceeds by taking one outgoing path ,predicatei , at a time 2. QUERY ANALYSIS PHASE in a breadth first manner in lines 7-16. Once a path is taken, The query graph is divided into different levels according its object node is annotated with the classes of its predicate’s range. In lines 11-13 all the domain classes of the paths from taining the RDF descriptions of the entities fetched from this object node are noted and their intersection taken. If the web. A node is added as a child to the one containing it is a subclass of the current class, it replaces the current the RDF description that led to it being fetched. The edge class of object node in lines 14-15. Now that the node has between the two nodes in G is labeled with the predicate been restricted, its ancestors are checked whether they also name denoting the relationship between the two entities in can be restricted with this new information in line 16. For them. The root is the seed which begins the execution of the example, if the node has been constrained to the class of query. The results occur at the leaves whose path to the root universities and its parent node has the class of managers matches their corresponding sequence of predicate terms in and professors. If there is schema element which says uni- the query. The context graph is also divided into different versity employs professors then we can restrict its parent to levels similar to the query graph. The context graph con- the class of professors. This procedure is repeated till all the tains bindings of the nodes in the query graph. If the subject predicates are traversed. The final classes annotated on the binding of a triple pattern in the context graph belongs to nodes, indicate that the query execution be restricted only level l, the object binding belongs to level l + 1. The con- to them resulting in lesser data being fetched from web. tents of a node of context graph contains the description of the entity in the shape of a star graph. The center rep- Algorithm 1: Query Graph Analysis resents the entity being described and the edges denote its properties and their values. The Context Graph also acts Input : :Query graph Q Output: :Updated node annotations of Q with classes as a cache. This allows us to check whether a resource’s indicating that the query execution be restricted details have already been fetched earlier before an attempt to them is made to retrieve it from the web, which further improves 1 foreach Predicate P of Q do the query performance. 2 Resolve P to fetch its description R 3 parse R to find out Pdomain and Prange 4 further fetch & parse Pdomain and Prange for their subclasses hierarchy 5 list ← null 6 append predicate0 into list 7 while list is not empty do 10000 8 remove listhead & assign it to p 9 node ← pobject ; nodeclasses ← prange univ:numberOfStudents 10 domain ← universalSet 11 foreach outgoing predicatei from node do nm1:worksFor nm1:worksFor 12 domain ← domain ∩ predicateidomain 13 append predicatei to list nm1:hasFriend nm1:hasFriend 14 if domain ⊂ prange then 15 nodeclasses ← domain nm1:hasFriend 16 restrict nodeancestors nm1:worksFor Though this phase serves to identify classes of data which do no produce results, it is limited in the patterns it can detect. Suppose we are given a query to find people in a de- 30 Masters partment who take a graduate course. Following the above nm1:age algorithm the range of first triple pattern predicate people univ:degree is the class of faculty and students. The domain of the second triple pattern predicate takesCourse is the class of nm1:designation nm1:worksFor students. Therefore the node connecting both the triples is constrained to the class of students. But the third triple pat- CONTENTS OF THE Manager NODE OF CONTEXT tern restricts the type of courses to graduateCourses. Since, GRAPH we do not have information in the vocabulary which says Graduate courses are generally only taken by students who have advisors we are unable to further constrain the node identifying the people from students to only students who have advisors. This calls for a run-time approach which an- alyzes the query execution to discover patterns which are missed by the query analysis phase. Figure 3: Context Graph 3. QUERY EXECUTION PHASE Example. An example for a Context Graph is shown in 3.1 Data-Structures Figure 3 that models the execution of the query in Figure 1. According to the query, Tim-Lee’s URI is resolved to 3.1.1 Context Graph fetch his RDF description from the web and is stored in This phase models the query execution as a context graph the corresponding node in the context graph. The context G. The context graph is built by adding nodes to it con- graph is built by creating nodes for Peter,Mary and John after fetching their details. It continues till the data for the a publication and if he does, display the advisor without last query pattern is retrieved. The results in this case is the fetching the details of publications. node containing the value 10000 and its path to the root. Summaries by themselves are unable to discover these re- lationships which calls for additional data structures. There- 3.1.2 Summaries fore two links structures are maintained between each pair Summaries are maintained of context nodes which are of nodes, say n1 and n2 in the given query graph. The used by the heuristic to make a decision on whether to ex- two types of link structures between the pair of nodes in the n1−n2 n1−n2 plore the current node of the context graph or not. Two query graph at level l are linkresults and linknoResults . They summaries are associated with each node of the query graph. are again star graphs with the center indicating a generic re- qnode One summary Sresults - summarizes the information of all sult or non-result producing node with weighted predicates the result producing context nodes associated with qnode attached to it. They are maintained as follows. If the result qnode and the other SnoResults - summarizes the information of all for a query is produced, the predicate denoting the relation- the nodes that did not lead to results for qnode. A summary ship between node bindings of n1 and n2 is inserted in the n1−n2 is basically a star graph whose center represents a generic star graph linkresults and its weight set to 1. If the pred- result producing or a non-result producing entity and edges icate is already present, its weight is incremented by one. n1−n2 denote the properties. The edges have a weight indicating If the result for the query is not produced linknoResults is the number of times that the triple containing that predicate similarly updated. This process is carried out for all pairs and object has occurred in the nodes previously fetched. An of nodes in the query graph. The bypassing of nodes under edge with a high weight in result-producing summary and certain circumstances may not be correct as it may lead to low weight in the non-result producing summary can be used results which would not have passed through the bypassed as a discriminant. If a node contains such an edge then it is nodes. Therefore, to deal with this situation, when a result highly probable it will produce results as it is similar to the is generated after bypassing we randomly check whether it nodes earlier that did produce results. Summaries at any would have also been generated by the un-bypassed original point of time contain only the top-K high weight triples. nodes. A variable is maintained which continuously com- Algorithm. When a URI is resolved to fetch a RDF de- putes the percentage difference of produced results between scription from the web, it is inserted in the context graph these two situations. If it is falls below a certain threshold, as a new node N . Lets say it is associated with node qn of we say that bypassing of nodes is not correct and we drop the query graph having the two summaries associated with it. it. If the execution continues for the new node and it pro- qn duces results, its details are inserted in Sresults else they are 3.2 Heuristic qn inserted in SnoResults . The insertion procedure is as follows. We can optimize the SPARQL query by halting the ex- Each triple in the new node N is compared with the triples ploration of certain nodes in the context graph which do not in one of the summary based on whether it produced results. produce results. This requires the formulation of a heuristic If both the predicate name and the object match, weight of which makes such a decision. This is similar to the Branch & the corresponding edge is incremented by one. Otherwise, Bound strategy using a heuristic to prune nodes in the con- if the number of triples in the summary is less than K, it is text graph. Reduced number of nodes corresponds to lesser inserted as a new triple and its weight is set to one. On the number of URIs being dereferenced to fetch their RDF de- other hand, if the number of triples are greater than K it scriptions and thus lesser time taken to execute the query. can randomly select the triple with least weight and replace The heuristic basically works by matching the contents of it. This executes for all the nodes constituting the path from a node with the contents of the summaries and links asso- N to the leaf updating summaries at all the levels with the ciated with all nodes of the query graph. The summaries corresponding details of nodes. are compared to discover edges which have a high weight in either one of them. Presence of such edges in the node 3.1.3 Links are used to assess its likelihood of producing results. Links The query graph may have a number of nodes which have are used to discover relationships between query nodes not relationships amongst them other than those specified by specified in the query graph. the query. These relationships are discovered by analyzing Example. For the query in Figure 1, suppose Tim-Lee has the execution of the query. Their discovery improves the another friend Richard who is a professor in some university. query performance by introducing an edge denoting the re- When his RDF description is fetched, a node is created for lationship between the query nodes, bypassing the original him in the context graph in Figure 3 in level 1. In order edges and nodes if none of the bypassed nodes need to be dis- to decide whether to continue exploring with the current played. Therefore, the data corresponding to the bypassed node, it is matched with the two summaries associated with nodes need not be fetched thereby reducing the query exe- the query node ”?friend” at the same level. Comparison of cution time. two summaries indicates that result producing nodes contain Example. Suppose we have a query which finds out the triple defining the entity to be type Professor and non- all the students who have authored a paper with a faculty result producing nodes contain the triple defining the entity member. In the query graph the nodes denoting students of type Manager. Richard’s description is found to contain and faculty has a node representing publications in between. the triple describing him as a professor and therefore it is It is executed by fetching the details of students followed by predicted that his node will generate results. their publication details followed by its co-authors to find a Algorithm. The heuristic is invoked each time a new faculty member. During its execution we find such a faculty node N is inserted into the context graph. The contents member is generally an advisor of the student. Therefore, of the new node is a description of an entity in the form using this pattern we can check whether the student has a star graph. Lines 2-11 in Algorithm 2 iterate over all the nodes in the query graph. The idea here is to find if Algorithm 2: Heuristic N contains a predicate which denotes a direct relationship Input : :Context graph and Query graph, New node N between nodes in the query graph allowing us to bypass orig- Output: :Decision on whether to explore N inal nodes between them. The heuristic iterates over all the 1 New node N associated with node n1 in query graph. predicates in N in line 3 and counts its matches in linkresults foreach node n2 in query graph do and linknoResults and stores the scores in variables scoreLr 2 foreach predicate prd in N do and scoreLnR respectively in lines 6 and 9. If the ratio of 3 ScoreLr ← 0 ; ScoreLnR ← 0 n1−n2 scoreLr to scoreLnR is greater than the threshold, we intro- 4 foreach Triple T r ∈ linkresults do 5 if T rpredicate = prd then duce a direct edge between the two nodes and exploration 6 ScoreLr = T rweight proceeds with this new edge in lines 10-11. Lines 12-29 in Algorithm 2 iterate over all the triples in n1−n2 7 foreach Triple T nr ∈ linknoResults do N . Each triple in N is compared with the triples in the 8 if T nrpredicate = prd then two summaries associated with its corresponding query node 9 ScoreLnR = T nrweight n1 in the query graph. This happens in lines 14-24 in the scoreLr >= RAT IO then 10 if algorithm 2. Two score variables are maintained for each scoreLnR summary. scoreP O counts the predicate and object value 11 Explore N with the next predicate prd instead of the original path ; exit matches like x advisorOf studentA whereas scoreP counts the predicate matches like advisorOf between the node’s and 12 foreach Triple T in N do summary’s contents. Lines 17-18 and 23-24 do the compar- 13 scoreP Oresults ← 0; scorePresults ← 0 isons and increment the scores by the weights of respective 14 n1 foreach Triple T r ∈ Sresults do edges. Ratios of the scores computed from result summaries 15 if T rpredicate = Tpredicate then with that computed from non-result summaries are deter- 16 if T robject = Tobject then mined. These ratios indicate the presence of edges which 17 scoreP Oresults = T rweight n n scorePresults = scorePresults + T rweight are predominant in Sresults but rare in SnoResults . If either 18 of the ratios are above a certain threshold in lines 25-26, we 19 scoreP OnoResults ← 0; scorePnoResults ← 0 can say that N contains a pattern commonly found in the 20 n1 foreach Triple T nr ∈ SnoResults do result producing nodes of its level. Hence, it is predicted 21 if T nrpredicate = Tpredicate then to produce results and the execution proceeds from the new 22 if T nrobject = Tobject then node. The inverse of ratios computed earlier are also com- 23 scoreP OnoResults = T nrweight pared with the threshold in lines 27-28, if they are greater 24 scorePnoResults = scorePnoResults + T nrweight it means that N is very similar to the non-result produc- ing nodes and it may also not produce results. Parameter if scoreP Oresults >= RAT IO or 25 scoreP OnoResults p represents the probability with which a node may be ex- scorePresults >= RAT IO then plored after being rejected by the heuristic. When the node scorePnoResults is found to be similar to the non-result producing nodes, the 26 Explore N ; exit probability of it being explored is diminished by a constant if scoreP OnoResults >= RAT IO or scorePnoResults 27 scoreP Oresults scorePresults e in line 28. then During the initial phase of the execution, the decisions 28 With probability δ/e Explore N and exit made for the nodes may not be correct. This happens be- cause the number of nodes to be compared against are less 29 Explore N with probability δ initially. If the current node is of the type which produces the result, it is unlikely that similar type of an entity may have been encountered earlier. The heuristic in this case is denoting entities and 500,595 edges denoting the relation- unable to determine its result producing capability. To over- ships between them. The efficacy of the proposed idea was come this shortcoming a parameter δ is introduced which demonstrated by executing a set of complex queries on the allows the exploration of a node even if the heuristic sug- simulated web of linked data of a university and comparing gests otherwise. This is crucial during the initial part of the the results with the existing approach. There are a number execution when the heuristic fails to predict the usefulness of patterns in the data and each of the query below demon- of the node. But in the latter part of the execution, the strates the ability of the proposed approach to discover and heuristic can bank on a lot of nodes which makes its deci- use them to optimize its execution. The results are judged sion very reliable. This suggests that the parameter vary according to the two metrics discussed next. continuously. It has a high value initially and it decreases continuously with each new node that leads to a result. Metrics. The time taken to execute the query is pro- portional to the number of URIs resolved to fetch their RDF 4. EXPERIMENTS descriptions during the course of query execution. The time The experiments were conducted on a Pentium 4 machine taken to determine whether to explore a node or not is dom- running windows XP with 1 GB main memory. All the pro- inated by the amount of time taken to retrieve the data from grams were written in Java. The synthetic data used for the the web. Therefore, minimizing the number of URIs resolved simulations was generated with the LUBM benchmark data has a bigger impact on the the query execution time. Thus generator [4]. The LUBM benchmark is basically an univer- the results are judged according to two metrics. The first sity ontology that describes all the constituents of a univer- metric α represents the percentage reduction in the number sity like its faculty,courses,students etc. The synthetic data of URIs dereferenced compared with the existing approach, is represented as a web of linked data with 200,890 nodes indicating the degree of optimization achieved. The second metric β denotes the percentage reduction in the complete- ness of results compared with the existing approach. Table 1: Results Query Ne Re Np Rp α β let Ne : Number of URIs dereferenced by existing approach Q1 135631 13911 42463 13076 69% 6% let Re : Number of results generated by existing approach Q2 71714 532 21362 532 71% 0% let Np : Number of URIs dereferenced by proposed approach Q3 145788 14316 33321 13170 77% 8% let Rp : Number of results generated by proposed approach Q4 72044 446 29437 446 59% 0%     Ne − Np Re − Rp α= ∗ 100 ; β = ∗ 100 Ne Re Queries Query1. Return the students in a department who take a graduate course. The query execution with the existing approach begins by fetching the details of all the students in the department followed by the details of courses taken by them. The courses are then filtered according to whether they are of the type graduate course or not. The proposed approach optimizes the query by recognizing that most of the graduate courses are taken by students who have an advisor. This is used to avoid retrieving course details of students who do not have an advisor. Query2. Return the students whose advisors have an Figure 4: Degree of Optimization achieved interest in the research area ”Research9”. The existing approach executes the query by fetching the details of all the students followed by the details of their ad- SPARQL query execution over the web of linked data. The visors. The advisor’s research interests are further retrieved approach is a two-phased strategy to discover patterns in and checked to see if its title is ”Research9”. The query is the query. These patterns are used to identify data that do optimized by the proposed approach by building a list of not contribute to answer the query and can be prevented faculty involved in ”Research9”. Then as the students de- from being fetched, resulting in a reduction in the query tails are fetched, they are checked to see if their advisors execution time. We ran simulations that demonstrated the belong to the list involved in ”Research9”. This results in efficacy of the proposed approach. Future work includes the avoiding the retrieval of data regarding advisors and their application of proposed approach to optimize conventional researching interests not involved in ”Research9”. SPARQL queries over RDF data. Query3. Return the students who co-authored a paper with a faculty member. 6. REFERENCES The query is executed by the existing approach by first fetch- [1] C. Bizer, T. Heath, and T. Berners-Lee. Linked data – ing the details of all the students followed by the details of the story so far. International Journal on Semantic their publications. The publication details reveals its co- Web and Information Systems, 5(3):1–22, 2009. authors whose details are again fetched. If any one of the [2] A. Deshpande, Z. Ives, and V. Raman. Adaptive query co-author is a faculty member, his URI is displayed as a processing. Found. Trends databases, 1(1):1–140, 2007. result along with the student’s. The proposed approach dis- [3] M. Franklin. From databases to dataspaces: A new covers that most of such faculty are infact advisors of the abstraction for information management. SIGMOD students. Therefore, as the student details are fetched, it is Record, 34:27–33, 2005. parsed to see if he authored a paper. If he did, his advisor [4] Y. Guo, Z. Pan, and J. Heflin. Lubm: A benchmark for is displayed as the co-author. This optimization results in owl knowledge base systems. J. Web Sem., the avoidance of retrieval of data pertaining to publications 3(2-3):158–182, 2005. and its co-authors. [5] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K.-U. Query4. Return the faculty who take courses which are Sattler, and J. Umbrich. Data summaries for attended by the students they advise. on-demand queries over linked data. In WWW ’10: The query is executed by the existing approach by first Proceedings of the 19th international conference on fetching the details of all the faculty followed by the details World wide web, pages 411–420, New York, NY, USA, of courses taken by them. This is followed by fetching the 2010. ACM. details of all the students who take these courses. Then the [6] O. Hartig, C. Bizer, and J.-C. Freytag. Executing students details are checked to see whether they are guided sparql queries over the web of linked data. In 8th by the professor who teaches that course. The proposed International Semantic Web Conference (ISWC2009), approach discovers the fact that lecturers take courses but October 2009. do not advise students. This fact is used to optimize the [7] Z. G. Ives and N. E. Taylor. Sideways information query by not fetching the details of lecturers’s courses and passing for push-style query processing. In ICDE, pages the students who take those courses. 774–783. IEEE, 2008. [8] E. Prud’hommeaux and A. Seaborne. SPARQL query 5. CONCLUSIONS AND FUTURE WORK language for RDF. W3C recommendation, World Wide In this paper we present an approach to optimize the Web Consortium, 2008.