=Paper=
{{Paper
|id=Vol-1261/SSWS2014_paper2
|storemode=property
|title=Scheduling for SPARQL Endpoints
|pdfUrl=https://ceur-ws.org/Vol-1261/SSWS2014_paper2.pdf
|volume=Vol-1261
|dblpUrl=https://dblp.org/rec/conf/semweb/MaaliHD14
}}
==Scheduling for SPARQL Endpoints==
Scheduling for SPARQL Endpoints Fadi Maali, Islam A. Hassan, and Stefan Decker Insight Centre for Data Analytics, National University of Ireland Galway {firstname.lastname}@insight-centre.org Abstract. When providing public access to data on the Semantic Web, publishers have various options that include downloadable dumps, Web APIs, and SPARQL endpoints. Each of these methods is most suitable for particular scenarios. SPARQL provides the richest access capabili- ties and is the most suitable option when granular access to the data is needed. However, SPARQL expressivity comes at the expense of high evaluation cost. The potentially large variance in the cost of different SPARQL queries makes guaranteeing consistently good quality of ser- vice a very difficult task. Current practices to enhance the reliability of SPARQL endpoints, such as query timeouts and limiting the number of results returned, are far from ideal. They can result in under utilisation of resources by rejecting some queries even when the available resources are sitting idle and they do not isolate “well-behaved” users from “ill- behaved” ones and do not ensure fair sharing among different users. In similar scenarios, where unpredictable contention for resources exists, scheduling algorithms have proven to be effective and to significantly enhance the allocation of resources. To the best of our knowledge, using scheduling algorithms to organise query execution at SPARQL endpoints has not been studied. In this paper, we study, and evaluate through simu- lation, the applicability of a few algorithms to scheduling queries received at a SPARQL endpoint. 1 Introduction When providing public access to data on the Semantic Web, publishers have various options that include downloadable dumps, Web APIs, and SPARQL endpoints. Each of these methods is most suitable for particular scenarios, how- ever none of them provides an ideal global solution [7, 13]. SPARQL, the rec- ommended W3C query language1 , is an attractive option to provide expressive access to RDF data. SPARQL is basically a graph pattern matching language that provides rich capabilities for slicing and dicing RDF data. The latest version, SPARQL 1.1, added support for aggregation, nested and distributed queries, and other features. However, supporting public SPARQL access to data is expensive. It has been shown that evaluating SPARQL is PSPACE-complete in general and coNP- complete for well-defined queries [10]. Therefore, the cost of different SPARQL 1 http://www.w3.org/TR/sparql11-query/ 20 Fadi Maali, Islam A. Hassan, and Stefan Decker queries can vary a lot; making guaranteeing consistently good quality of service a very difficult task. Evidence of this can be seen on the SPARQL Endpoint Status web page2 , in literature [2] and across the Web3 and the blogosphere4 . Existing SPARQL endpoints employ different measures to enhance their re- liability and to ensure consistent quality of service. Such measures include query timeouts, refusing expensive SPARQL queries, limiting the number of triples re- turned or returning partial results. For example, 4Store supports a soft limit for execution time5 and Virtuoso allows setting a maximum threshhold on the ex- pected query cost6 . There are still a number of problems with these approaches: (i) they provide an inconsistent user experience and limit the expressiveness of allowed queries (ii) there is no clear way to communicate these non-standard shortcomings to the user (iii) they can result in under utilisation of resources by rejecting some queries even when the available resources are sitting idle (iv) they do not isolate “well-behaved” users from “ill-behaved” ones and do not ensure fair sharing among different users. In similar scenarios, where unpredictable contention for resources exists, scheduling algorithms have proven to be effective and to significantly enhance the allocation of resources. Scheduling has been utilised for data networks [3, 12], for processes assignment in operating systems [8], for cloud and grid computing [9, 4], and recently to schedule jobs sent to a Hadoop cluster [15, 14]. Nevertheless, to the best of our knowledge, it has not been studied in the context of SPARQL endpoints. In this paper we argue for employing scheduling algorithms to organise query execution at a, possibly public, SPARQL endpoint. Giving the wide applicability of scheduling, there exists a large number of scheduling algorithms. In this paper, we study, and evaluate through a simulation, the applicability of a few algorithms to schedule queries received at some SPARQL endpoint. A number of scheduling algorithms, mainly from the data networks domain, are reviewd (Section 3.1). We then describe their applicability to scheduling SPARQL queries (Section 3.2) and study through a simulation their effect on two popular SPARQL engines (Section 4). We do not claim that scheduling solves the problem of providing a reli- able publicly-accessible SPARQL endpoint. Nevertheless, our results show that scheduling enhances throughput and reduces the effect of complex queries on simpler ones. We also note that scheduling has the extra advantages of reward- ing socially-aware behaviour and achieving better utilisation and fairer allocation of the available resources. 2 http://sparqles.okfn.org/ 3 See for example http://answers.semanticweb.com/questions/14440/ can-the-economic-problem-of-shared-sparql-endpoints-be-solved 4 E.g. http://daverog.wordpress.com/2013/06/04/the-enduring-myth-of-the-sparql-endpoint/ and http://ruben.verborgh.org/blog/2013/09/30/ can-i-sparql-your-endpoint/ 5 http://4store.org/trac/wiki/SparqlServer 6 http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/ VirtSPARQLEndpointProtection Scheduling for SPARQL Endpoints 21 Fig. 1: General model of scheduling 2 Related Work Most current practices to enhance the reliability of SPARQL endpoints are based on introducing ad-hoc measures and limits such as query timeouts, refusing expensive SPARQL queries, limiting the number of triples returned or returning partial results. These measures have a number of problems as discussed in the introduction. Linked Data Fragments [13] is a recent proposal to enhance the reliability of SPARQL endpoints via shifting part of the computation needed to answer SPARQL queries towards the client side. Furthermore, [1] proposed a vision for a workload-aware and adaptive system to deal with the diversity and dynamism that are inherent in SPARQL workloads. To the best of our knowledge, scheduling has not been studied in the context of SPARQL endpoints. Nevertheless, scheduling has been used and studied in many domains. Scheduling has been extensively studied in data networks where the capac- ity of a switch is shared by multiple senders [3, 6, 12]. We describe the main algorithms used in data networks in the next section. Scheduling has also been used to organise jobs of shared Hadoop clusters. First In First Out (FIFO) Scheduler, Fair Scheduler7 and Capacity Scheduler8 are the most widely used schedulers in practice. Similar to our work, these sched- ulers re-use algorithms defined for the data networks. Notice that, in contrast to SPARQL queries, Hadoop jobs are expected to take a long time and their execution can be pre-empted. Moreover, scheduling has also been applied in wireless networks [5], grid computing [4] and distributed hash tables [11]. 3 Scheduling Figure 1 depicts a generic model for scheduling. There is a single server, and N job arrival streams, each feeding a different first in, first out (FIFO) queue. 7 http://hadoop.apache.org/docs/r1.2.1/fair_scheduler.html 8 http://hadoop.apache.org/docs/r1.2.1/capacity_scheduler.html 22 Fadi Maali, Islam A. Hassan, and Stefan Decker The scheduler decides which stream gets served at each moment and for how long. The goal is usually to maximise throughput while ensuring fair sharing of resources and avoiding long waiting times. Keeping a separate queue for each stream puts up “firewalls” protecting well-behaved streams against streams that might otherwise saturate the server capacity and drive delays to unacceptable levels. 3.1 Scheduling Algorithms We next describe a number of scheduling algorithms used mainly for data net- works and then discuss their applicability to SPARQL endpoints. Ideal Scheduler A mathematical idealization of queuing, which was originally proposed as an idealization of time-slicing in computer systems (Kleinrock 1976 as cited in [6]), known as Processor Sharing (PS). In PS, the server cycles through the active jobs, giving each a small time quantum of service, and then preempting the job to work on the next. The PS mechanism allots resources fairly and provides full utilisation of the resources. However, in settings where pre-empting jobs is not feasible, such as data networks, PS cannot be applied. A number of “emulations” of it exist nevertheless. We next discuss two of them. Fair Scheduler Described in [3] and [6]. Fair scheduling emulates the PS fair scheduling by maintaining a notion of virtual time (what time an input stream would have been served had PS been applied). Streams are then served in in- creasing order of their virtual finish time. It has been shown that this algorithm emulates the fair PS algorithm well [6]. Deficit Round-Robin Scheduler Described in [12]. Each input stream holds a deficit counter (a credit balance) and only gets served if the query cost is less than its balance. After execution, the cost is subtracted from the balance. If the queue is not served due to insufficient balance, it gets a quantum charge that can be used in the next round, i.e. a queue is compensated when its service is delayed. 3.2 Scheduling for SPARQL We consider scheduling as a service running on top of SPARQL endpoints. The scheduler receives the queries and then decides in what order to send them to the endpoint. The goal is to: (i) minimize the effect that expensive queries can have on other queries (ii) maximize the utilization of the available resources (iii) reward socially-aware behaviour (e.g., simpler queries that set a limit on the number of required results). However, contrary to the typical scheduling scenario depicted in Figure 1, a SPARQL endpoint can handle multiple queries at a time. In fact, as triple Scheduling for SPARQL Endpoints 23 stores, Web servers and the underlying operating systems each have their own resource utilisation and sharing facilities, it will be very inefficient to send only one query at a time to the endpoint. Therefore, SPARQL scheduler sends mul- tiple queries to the endpoint as long as their total cost is under a configured threshold. Upon the completion of processing some query, its cost is substracted from the tracked total cost. The threshold is necessary to avoid overwhelming the endpoint as sending many queries simultaneously to an endpoint results in rejecting to process many queries by the endpoint. In practice, three further challenges need to be addressed: – Efficiently estimating the cost of an incoming SPARQL query. The cost in- volves multiple paramters such as CPU, memory, network and I/O cost. The cost also depends on the query, the data and the triple store. However, this problem is beyond the scope of this paper. – Identifying the source of each query in order to define streams of inputs. In the absence of user authentication, IP addresses and session detection techniques can be used. – Setting the threshold of total computing capacity. This can be set via exper- imenting. 4 Simulation Experiment 4.1 Implementation We implemented three scheduling algorithms in Java 9 : – FIFO: A First-In-First-Out queue. This scheduler serves input streams in order (one query at a time) while keeping track of the total cost of queries being processed at every point and ensuring that this cost is always kept lower than the computing capacity threshold. – Deficit: Implements a deficit round-robin algorithm as described in Sec- tion 3.1. – Fair: Implements a fair scheduling algorithm as described in Section 3.1. 4.2 Experiment Setup We experimented with two triple stores, Virtuoso Open-Source Edition 7.0.0 Release10 and Jena Fuseki 1.0.111 . Both triple stores were run on a Mac with 8GB memory and a 2.9GHz Intel Core i7 CPU. Jena Fuseki was run in memory with a maximum heap size of 3GB. We used the Semantic Web Dog Food12 data that contains information about papers published in the main conferences and workshops in the area of Semantic 9 The code is avaialble at https://gitlab.insight-centre.org/Maali/ sparql-endpoints-scheduler 10 http://virtuoso.openlinksw.com/ 11 http://jena.apache.org/documentation/serving_data/ 12 http://data.semanticweb.org/ 24 Fadi Maali, Islam A. Hassan, and Stefan Decker Web research. The data was downloaded13 from the Semantic Web Dog Food website in Februaury 2014. The Web logs of the Semantic Web Dog Food were made available as part of the USEWOD 2013 Data Challenge14 . To get real-world SPARQL queries for our experiment, we extracted SPARQL queries from these Web logs. We ran each query three times on Fuseki (without any added scheduling) and classified it as simple or complex based on the time it took. All simple queries took an average of less than 90ms, while complex queries took more than 450ms. We use these numbers (90 and 450) as a proxy for the query costs. We simulated two concurrent flows of queries, one with simple queries and the other with complex queries. This simulates two applications with different needs. Queries were selected randomly from the set of simple and complex queries respectively. Both flows follow a Poisson distribution. We experimented with differnet means of the Poisson distribution (a.k.a. interval) and with different thesholds for the computing capacity (i.e., total cost of queries being processed at a time). Each run was stopped after 5 minutes and logs were collected then. We consider a query to be fully processed if it is sent to the endpoint and all results sent back from the endpoint are received (within the 5 minute cut-off). For each query we measure two durations: Waiting Time: the time taken from the moment the query is received at the scheduler until it is sent to the endpoint. Processing Time: the time taken from the moment the query is received at the scheduler until the moment at which the query is fully processed. We wanted to include a no-scheduling setting as a baseline, however running these flows without any scheduling resulted in overwhelming the endpoint and therefore most of the queries were dropped without getting answered. 4.3 Results & Discussion We experimented with intervals of 100, 200, 500 and 1000 milliseconds (ms). Queries sent with intervals 500 and 1000 ms were not frequent enough to require any waiting and therefore resulted in no scheduling. We report only on queries with intervals 100 and 200 ms. We experimented with different thresholds for the computing capacity. We report here only on the results when the threshold is set to 8000 because it showed the most effective results for the three scheduling algorithms tested; all larger values overwhelmed the endpoint. Notice that 8000 is just a proxy for the computation capacity available and it needs to be interpreted together with the cost of the queries (we used 90 and 450 for simple and complex queries respectively). Table 1 shows the percentage of queries completed when sent at a 100 ms interval on Fuseki. On Virtuoso all queries were fully processed, while on Fuseki it can be noticed that the FIFO algorithm processed equivalent percentages of 13 http://data.semanticweb.org/dumps/ 14 http://data.semanticweb.org/usewod/2013/challenge.html Scheduling for SPARQL Endpoints 25 106 106 105 105 Processing time (milliseconds) Processing time (milliseconds) 104 104 103 103 102 102 101 101 100 200 100 200 Interval (milliseconds) Interval (milliseconds) (a) Processing time - simple queries (b) Processing time - complex queries 106 106 105 105 Waiting time (milliseconds) Waiting time (milliseconds) 104 104 103 103 102 102 101 101 100 200 100 200 Interval (milliseconds) Interval (milliseconds) FIFO (c) Waiting time - simple queries (d) Waiting time - complex queries Deficit FIFO Fair FIFO Deficit Query time in seconds Deficit Fair Query time in seconds Fig. 2: Virtuoso - Processing andFair waiting times Query time in seconds 102 102 2 both simple and complex queries while10the other two algorithms “favoured” simple queries and “penalised” complex ones. Deficit scheduling showed better throughput. Figures 2 and Figure 3 show the processing and waiting times for simple and complex queries when101 running against Virtuoso and Fuseki. Similar to the throughput results, it can be 101 noticed Query1 how penalising Query2 complexQuery4 Query3 queries results Query5 in smaller waiting and processing times Query1 for10 simple 1 Query2when using queries Query3 deficit orQuery4 fair Query5 scheduling. On the other hand, complex queries have Query1 smaller Query2waiting and Query3 pro- Query4 Query5 cessing times using FIFO scheduling with 200 milliseconds interval. The higher processing time of complex queries using FIFO scheduling was surprising. Our interpretaion of this is that prioritising simple queries allowed better utilisa- 26 Fadi Maali, Islam A. Hassan, and Stefan Decker 106 106 105 105 Processing time (milliseconds) Processing time (milliseconds) 104 104 103 103 102 102 101 101 100 200 100 200 Interval (milliseconds) Interval (milliseconds) (a) Processing time - simple queries (b) Processing time - complex queries 106 106 105 105 Waiting time (milliseconds) Waiting time (milliseconds) 104 104 103 103 102 102 101 101 100 200 100 200 Interval (milliseconds) Interval (milliseconds) FIFO (c) Waiting time - simple queries (d) Waiting time - complex queries Deficit FIFO Fair FIFO Deficit Query time in seconds Deficit Fair Query time in seconds Fig. 3: Fuseki - Processing andFairwaiting times Query time in seconds 102 102 102 tion of the available resources when queries arrive at a high frequency that can overwhelm the endpoint.1 10 1 10 Query1 Query2 Query3 Query4 Query5 Query1 In summary, scheduling allowed getting Query2 101better throughput Query3 by Query4 delaying queries Query5 instead of rejecting them (recall that runningQuery1 without scheduling Query2 resulted in Query3 re- Query4 Query5 jecting most of the queries). Deficit and Fair scheduling algorithms favoured simpler queries. In general, deficit scheduling, the simpler algorithm, showed better results in our settings than fair scheduling. Scheduling for SPARQL Endpoints 27 Simple Queries Complex Queries FIFO 47.3% 47.4% Deficit 100% 42% Fair 77% 40% Table 1: Percentage of fully processed queries on Fuseki (throughput) 5 Conclusions & Future Work We reported a simulation experiment to study the effects different scheduling algorithms can have when used to organise execution of SPARQL queries re- ceived at some endpoint. We note that simple scheduling can be implemented with minimal overhead and has effect only when queries are received at high frequency. We consider this work as an initial step and hope to extend the experiment and deploy it in some real-world use case. Acknowledgements. Fadi Maali is funded by the Irish Research Council, Embark Postgraduate Scholarship Scheme. This publication has emanated from research supported in part by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289. References 1. G. Aluc, M. T. Özsu, and K. Daudjee. Workload matters: Why rdf databases need a new design. PVLDB, 7(10):837–840, 2014. 2. C. Buil-Aranda, A. Hogan, J. Umbrich, and P.-Y. Vandenbussche. SPARQL Web-Querying Infrastructure: Ready for Action? In ISWC 2013, pages 277–293. Springer, 2013. 3. A. Demers, S. Keshav, and S. Shenker. Analysis and simulation of a fair queueing algorithm. In ACM SIGCOMM Computer Communication Review, volume 19, pages 1–12. ACM, 1989. 4. F. Dong and S. G. Akl. Scheduling Algorithms for Grid Computing: State of the Art and Open Problems. School of Computing, Queen’s University, Kingston, Ontario, 2006. 5. H. Fattah and C. Leung. An Overview of Scheduling Algorithms in Wireless Mul- timedia Networks. Wireless Communications, IEEE, 9(5):76–83, 2002. 6. A. G. Greenberg and N. Madras. How fair is fair queuing. Journal of the ACM (JACM), 39(3):568–598, 1992. 7. A. Hogan and C. Gutierrez. Paths towards the Sustainable Consumption of Se- mantic Data on the Web. In AMW, 2014. 8. L. Kleinrock. Queueing systems, volume II: Computer applications. 1976. 9. H. Kllapi, E. Sitaridi, M. M. Tsangaris, and Y. Ioannidis. Schedule Optimization for Data Processing Flows on the Cloud. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 289–300. ACM, 2011. 10. J. Pérez, M. Arenas, and C. Gutierrez. Semantics and complexity of SPARQL. ACM Transactions on Database Systems (TODS), 34(3):16, 2009. 28 Fadi Maali, Islam A. Hassan, and Stefan Decker 11. S. Rhea, B. Godfrey, B. Karp, J. Kubiatowicz, S. Ratnasamy, S. Shenker, I. Stoica, and H. Yu. OpenDHT: A Public DHT Service and its Uses. In ACM SIGCOMM Computer Communication Review, volume 35, pages 73–84. ACM, 2005. 12. M. Shreedhar and G. Varghese. Efficient fair queuing using deficit round-robin. Networking, IEEE/ACM Transactions on, 4(3):375–385, 1996. 13. R. Verborgh, M. Vander Sande, P. Colpaert, S. Coppens, E. Mannens, and R. Van de Walle. Web-scale Querying through Linked Data Fragments. In Pro- ceedings of the 7th Workshop on Linked Data on the Web, 2014. 14. M. Yong, N. Garegrat, and S. Mohan. Towards a Resource Aware Scheduler in Hadoop. In Proc. ICWS, pages 102–109, 2009. 15. M. Zaharia. Job scheduling with the fair and capacity schedulers. Hadoop Summit, 9, 2009.