=Paper=
{{Paper
|id=Vol-1733/paper-14
|storemode=property
|title=Studying Metadata for better client-server trade-offs in Linked Data publishing
|pdfUrl=https://ceur-ws.org/Vol-1733/paper-14.pdf
|volume=Vol-1733
|authors=Miel Vander Sande
|dblpUrl=https://dblp.org/rec/conf/semweb/Sande16
}}
==Studying Metadata for better client-server trade-offs in Linked Data publishing==
Studying Metadata for better client-server trade-offs in Linked Data publishing Miel Vander Sande Ghent University – iMinds Sint-pietersnieuwstraat 41, B-9000 Ghent, Belgium miel.vandersande@ugent.be 1 Problem statement Since the introduction of the Semantic Web, querying Linked Data mostly utilizes two types of interfaces: Linked Data documents, or the sparql protocol. However, both do not cover the wide spectrum of possible use cases and their specific requirements. Not only is the amount of public sparql endpoints is limited, they also suffer from frequent downtime [6, 15]. Predicting the consumption of com- putational resources of an endpoint is hard, because of sparql’s expressiveness and individual user demand. Linked Data documents are more predictable, but querying based on traversing links is significantly slower and renders less com- plete results. Unfortunately, both are very undesired for reliable user applications. These issues above hint at a need for other client/server trade-offs. Such trade-offs can be analyzed using Linked Data Fragments (ldf) [17], which proposes an uniform view on all interfaces to rdf. A Linked Data Fragment is characterized by a specific selector (e.g., subject uri, sparql query), metadata (e.g., variable names, counts), and controls (e.g., links or uris to other fragments). This reveals a complete spectrum between Linked Data documents and the sparql protocol, in which we can advance the state-of-the-art of Linked Data publishing. This spectrum can be explored in the following two dimensions: i) selector, allowing different, more complex questions for the server; and ii) metadata, extending the response with more information clients can use. This work studies the second metadata dimension in a practical Web context. Considering the conditions on the Web, this problem becomes three-fold. First, analog to the Web itself, ldf interfaces should exist in a distributed, scalable manner in order to succeed. Generating additional metadata introduces overhead on the server, which influences the ability to scale towards multiple clients. Second, the communication between client and server uses the http protocol. Modeling, serialization, and compression determine the extra load the overall network traffic. Third, with query execution on the client, novel approaches need to apply this metadata intelligently to increase efficiency. Concretely, this work defines and evaluates a series of transparent, interchange- able, and discoverable interface features. We proposed Triple Pattern Fragments (tpf) [17], a Linked Data interface with low server cost, as a fundamental base. This interface uses a single triple pattern as selector. To explore this research space, we append this interface with different metadata, starting with an esti- mated number of total matching triples. By combining several tpfs, sparql 2 Miel Vander Sande et al. queries are evaluated on the client-side, using the metadata for optimization. Hence, we can study the impact of metadata on query execution time, bandwidth overhead, caching effectiveness, and server overhead. 2 Relevancy The problem described in the previous section is relevant for both Linked Data consumption and publishing. Our approach specifically aims at introducing new client-server trade-offs. Thereby, our approach directly increases the granularity of engagement Linked Data publishers can take [13]. This ability to optimize between cost and utility, installs a lower threshold for publishing queryable Linked Data, ultimately leading to more available and easily consumable datasets. In turn, this drastically increases reliability and strength of Linked Data based client applications [12]. Current infrastructure, such as sparql endpoints or data dumps, have proven to be insufficient to introduce major adoption in application development. This work facilitates reliable and dynamic data services in various domains, including eCommerce and public sector. 3 Related work Many related works can be found in distributed databases [3] and hybrid query shipping [5, 7], which is already a very mature area. However, these works use a local dedicated network. Works that apply these techniques in a Web context and to Linked Data, which have different restraints, are still limited. The works that do use metadata for sparql query optimization, either apply a centralized approach [11], or do not measure the process of metadata extraction and shipping (e.g., federated query systems) [1, 8, 14]. This is because most research considers the sparql specification a given, Some work has been done on more specific types of metadata. Highly relevant is the proposal to extend the ask query response [9] with a Bloom filter, representing a combinations of bindings, i.e. two variables in a triple pattern, to improve source selection in sparql query federation frameworks. However, the benefit in a single-server setup is unclear. 4 Research question(s) This work seeks an answer to the following main research question: How do different types of fragment metadata affect the relation between interface cost and utility with regard to client-side query execution? In this respect, we also formulate the following subquestions for a series of selected types: – Can such all selected metadata be modeled in rdf so it can be reconstructed on the client? – How does out-of-band delivery of metadata, i.e. included in a separate http resource, compare to in-band delivery, in terms of query execution time? Metadata for better client-server trade-offs 3 – What is the added server memory and cpu cost in generating such metadata? – How does the type of metadata impact the shipping cost between server and client in a Web context? – Can metadata decrease federated query execution time over multiple Linked Data sources? – Can hypermedia to other relevant interfaces increase recall for federated queries? – Can such metadata decrease the amount of http requests used by the client-side query execution? 5 Hypotheses In respect to the stated research questions, we formulate the following hypotheses: – A client can reconstruct metadata described in a formalized vocabulary. – Out-of-band delivery of metadata decreases query execution time compared to in-band. – Generating metadata introduces an insignificant server cost compared to the total server cost. – The metadata introduces a significant increase in shipping cost. – Hypermedia can dynamically increase recall for queries federated of a Web of Data. – Metadata significantly reduces the amount of http request required by clients to answer a query. 6 Preliminary results In the previous years, we have already conducted research with cardinality [18] and Approximate Membership metadata [16]. The results of the cardinality experiments indicate that, at the cost of increased query times, executing queries over tpf reduces server usage. tpf servers cope better with increasing numbers of clients than sparql endpoints. They have a generally low and regular cpu load, accompanied by less variation in response time. Furthermore, querying benefits strongly from regular http caching, which can be added at any point in the network. These three facts validate that the interface reduces the server-side cost to publish knowledge graphs. This is all the more remarkable since, to allow comparisons with other work, these results were obtained with an existing sparql benchmark that focuses on performance, not server cost. A second experiment validates that this behavior extends to real-world knowl- edge graphs such as dbpedia. A vast majority of queries stays well below the 1 second limit, despite being affected by the knowledge graph size. We note a strong influence of the type of query, especially when non-bgp sparql constructs are involved. A third experiment shows that, although more compact formats show a decrease in query execution time, these findings no longer apply when responses are compressed by gzip, commonly used within the http protocol. Also, the 4 Miel Vander Sande et al. serialization and deserialization costs can be decisive, especially if they involve relatively few triples—which is the case for typical page sizes (e.g., 100) of a tpf interface. The experiment shows the importance of carefully considering serializations. Even though removing or shortening metadata and control triples would work for specialized tpf clients, the applicability of the application would be narrowed. In terms of Approximate Membership metadata (amfs), we augmented the tpf interface with both Bloom filters and Golomb-coded sets, which are two types of Approximate Membership Filters. We aimed at reducing http requests by avoiding expensive triple membership checks, since for one third of a set of diverse query types, most of the request overhead are membership subqueries. At the expense of one extra request to fetch the approximate membership metadata, potentially many more could be saved. Indeed, the experimental results confirm a drastic decrease in requests for half of the 250 randomly generated WatDiv [2] queries, while others experience little overhead thanks to local caching. Furthermore, this addition does not affect the low-cost nature of the server, which only has a limited load increase. However, there is a computational overhead on the client for queries that are not improved. An intelligent client should minimize this, by deciding when to use membership metadata based on the query type. Despite the reduction of requests, the total execution time is higher on average because of long delays introduced to generate amfs. Therefore, we conclude that this metadata is not suitable for real-time computation. We therefore recommend to pre-compute or pre-cache it in advance. A strong benefit of http caching has been proven for tpf querying [17] due to the limited possible number of requests, and this mechanism can be applied efficiently to tpfs with augmented metadata. While Bloom filters are preferred for lower computation time, the smaller size of Golomb-coded sets would prevail in the presence of caching. To prevent the overhead of generating and transferring amfs, they could be served in a separate resource that clients explicitly request when needed. 7 Approach Our approach defines a series of transparent, interchangeable, and discoverable interface features. These feature supply informative metadata, and can be ignored by the client if not needed. This process in split into five sequential steps, which are studied in this work: selecting, generating, modeling, shipping and consuming. The complete setup is illustrated in Figure 1. 7.1 Selecting metadata This first step identifies candidate types to include in this work, originating from an extensive literature study (as briefly mentioned in Section 3). Good candidates are usable in the context of i) the RDF format, and ii) the Web, i.e. they are resistant to the delays, protocols and serializations. Thus, given these restrictions, we conduct a feature-based analysis to assess existing and novel metadata techniques for query optimization. For each selected metadata type, we compose a new interface based on the Triple Pattern Fragments interface. Metadata for better client-server trade-offs 5 Client HTTP Server SPARQL (5) Consuming (4) Shipping Query Request Fragment (2) Generating (3) Modeling Linked Query Data Algorithm Response Fragment + Metadata Source Results (1) Selecting Fig. 1: Complete setup with 5 sequential steps that are subject to research. As a primary focus, we selected four metadata types: 1. cardinality: the amount of triple patterns matches 2. membership: a compact representation of the set of matches 3. summary: a compact representation of the complete dataset 4. discovery: a set of links, i.e. hypermedia, to navigate to similar interfaces to retrieve more relevant data However, future research may uncover new interesting types or variations that can be included in this research. 7.2 Generating metadata Next, we study the methods that extract the necessary metadata. Important here is the introduced overhead on the server, which directly impacts the cost of hosting such interface reliably. Therefore, any extraction process should minimize its average cpu usage, relative to the overall cpu usage. Accordingly, we propose evaluating the following algorithms to construct specific metadata from a existing knowledge base: 1. constructing a Approximate Membership filter (e.g., Bloom filter) from a fragment 2. profiling and summarizing an rdf dataset 3. triple pattern cardinality estimation 4. a summary index for constructing relevant hyperlinks to other interfaces. 7.3 Modeling metadata To ensure the scalability of our approach in the distributed Web environment, we aim at loose coupling between client and server. Therefore, the notion of a self-descriptive interface is key. An rdf description on how the client should interpret the metadata is included in the server’s response. Specifically, this is modeled using the Hydra core vocabulary from the Hydra wc Community Group, augmented with void, a novel vocabulary for Approximate Membership Filters 1 , and an adjusted data summary vocabulary loosely based on void. 1 http://semweb.datasciencelab.be/ns/membership# 6 Miel Vander Sande et al. 7.4 Shipping metadata On the Web, shipping (meta)data from server to client, is subject to its charac- teristics: the http protocol, the available network bandwidth, and the resource- oriented design. Therefore, we study techniques that improve the effects of different metadata on caching and response size. The former determines how metadata should be embedded in the request-response cycle. For example, considering a single request, is the metadata supplied in-band or as a separate resource? The latter dictates download speeds, thus requires optimal serialization or compression. 7.5 Consuming metadata Finally, we introduce techniques to improve client-side query execution using the metadata, provided by the interface. Improvements are made by i) lowering the number of required http requests to solve a query; ii) improving the recall of query results by applying automatic dataset discovery. 8 Evaluation plan The evaluation of our approach is specific to the task that a client needs to perform, i.e., the use case. For this work, we evaluate in context of client-side sparql querying, which is selected as main use case. Therefore, we rely on a few established query mixes for sparql endpoint testing: 1. the Berlin benchmark [4], for fair comparison with existing single-machine systems 2. the WatDiv benchmark [2], for in-depth analysis of the performance of specific query patterns 3. the DBpedia benchmark [10], for real-world scenarios 4. the FedBench benchmark, for measuring to what extend hypermedia can increase the recall of federated queries We implement the aforementioned interfaces by extending the existing NodeJS server2 . Also, we extend the NodeJS query client3 to automatically discover all metadata and adjust the query execution accordingly. Next, we have built a benchmarking tool that measures the following: 1. total query execution time 2. time to first result 3. response size 4. the number of http requests 5. individual and average request duration 6. cpu and memory usage on both client and server With this tool, we run all three query mixes in several iterations on datasets with different sizes and various http cache setups. Results are compared against the tpf baseline (cardinality metadata) to assess the improvement, and against state-of-the-art sparql query systems. 2 http://github.com/LinkedDataFragments/Server.js 3 http://github.com/LinkedDataFragments/Client.js Metadata for better client-server trade-offs 7 9 Reflections The practical aspects of Linked Data querying have been understudied so far. Focus has been on query execution time, precision and recall, while the feasibility of most of sparql and Linked Data query approaches is questionable. A Web context introduces many important characteristics, restrictions and opportunities, which are not mentioned or evaluated. As a result, we have not seen a widespread adoption of queryable Linked Data sources yet, or applications that rely on them. LODstats (http://stats.lod2.eu/) counts 9960 Linked Datasets, of which only 187 endpoints exist error-free, which is only 0.02%. According to the more modern SPARQLES (http://sparqles.ai.wu.ac.at/), there are currently 535 endpoints, with currently only 44.67% with sufficient availability. Note that, although the number of endpoints has tripled since 2011, the availability rate has not improved. A recent count from LODLaundromat (http://lodlaundromat. org/wardrobe/), indicates that around 658,018 Linked Datasets exist, each of which is available as a Triple Pattern Fragments interface [13]. Thus, we can only conclude that relative number of endpoints is decreasing steadily, whereas the number of Triple Pattern Fragments interfaces is keeping up. This makes research for such lightweight interfaces important. The essence is that, by enabling more nuance and demanding less from servers, you can get more done with Linked Data. References 1. Acosta, M., Vidal, M.E., Lampo, T., Castillo, J., Ruckhaus, E.: anapsid: An adap- tive query processing engine for sparql endpoints. In: Proceedings of the 10th In- ternational Conference on The Semantic Web. pp. 18–34. ISWC’11, Springer-Verlag, Berlin, Heidelberg (2011), http://dl.acm.org/citation.cfm?id=2063016.2063019 2. Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: The Semantic Web – ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part I, chap. Diversified Stress Testing of RDF Data Management Systems, pp. 197–212. Springer International Publishing, Cham (2014), http://dx.doi.org/10.1007/978-3-319-11964-9_13 3. Bernstein, P.A., Goodman, N., Wong, E., Reeve, C.L., Rothnie Jr, J.B.: Query processing in a system for distributed databases (sdd-1). ACM Transactions on Database Systems (TODS) 6(4), 602–625 (1981) 4. Bizer, C., Schultz, A.: Benchmarking the performance of storage systems that expose sparql endpoints. World Wide Web Internet And Web Information Systems (2008) 5. Bowman, I.T.: Hybrid shipping architectures: A survey. University of Waterloo February (2001) 6. Buil-Aranda, C., Hogan, A., Umbrich, J., Vandenbussche, P.Y.: sparql Web- querying infrastructure: Ready for action? In: 12th International Semantic Web Conference (Nov 2013) 7. Franklin, M.J., Jónsson, B.T., Kossmann, D.: Performance tradeoffs for client-server query processing. In: ACM SIGMOD Record. vol. 25, pp. 149–160. ACM (1996) 8. Görlitz, O., Staab, S.: splendid: sparql endpoint federation exploiting void de- scriptions. In: Proceedings of the 2nd International Workshop on Consuming Linked Data. Bonn, Germany (2011), http://uni-koblenz.de/~goerlitz/publications/ GoerlitzAndStaab_COLD2011.pdf 8 Miel Vander Sande et al. 9. Hose, K., Schenkel, R.: Towards benefit-based rdf source selection for sparql queries. Proc. of the 4th International Workshop on Semantic Web Information Management pp. 1–8 (2012) 10. Morsey, M., Lehmann, J., Auer, S., Ngonga Ngomo, A.C.: Dbpedia sparql benchmark–performance assessment with real queries on real data. The Semantic Web–ISWC 2011 pp. 454–469 (2011) 11. Neumann, T., Weikum, G.: x-rdf-3x: Fast querying, high update rates, and consistency for rdf databases. In: Proceedings of the International Conference on Very Large Data Bases. vol. 3, pp. 256–263. VLDB Endowment (Sep 2010) 12. Rietveld, L., Beek, W., Schlobach, S.: Lod lab: Experiments at lod scale. In: The Semantic Web-ISWC 2015, pp. 339–355. Springer (2015) 13. Rietveld, L., Verborgh, R., Beek, W., Vander Sande, M., Schlobach, S.: Linked data-as-a-service: The Semantic Web redeployed. In: Proceedings of the 12th Extended Semantic Web Conference (Jun 2015), http://linkeddatafragments.org/ publications/eswc2015-lodl.pdf 14. Saleem, M., Khan, Y., Hasnain, A., Ermilov, I., Ngonga Ngomo, A.C.: A fine-grained evaluation of sparql endpoint federation systems. Semantic Web Journal (2014), http://svn.aksw.org/papers/2014/fedeval-swj/public.pdf 15. Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: International Semantic Web Conference, pp. 245–260 (2014) 16. Vander Sande, M., Verborgh, R., Van Herwegen, J., Mannens, E., Van de Walle, R.: Opportunistic Linked Data querying through approximate membership metadata. In: Arenas, M., Corcho, O., Simperl, E., Strohmaier, M., d’Aquin, M., Srinivas, K., Groth, P., Dumontier, M., Heflin, J., Thirunarayan, K., Staab, S. (eds.) The Semantic Web – ISWC 2015. Lecture Notes in Computer Science, vol. 9366, pp. 92– 110. Springer (Oct 2015), http://linkeddatafragments.org/publications/iswc2015- amf.pdf 17. Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Van- der Sande, M., Cyganiak, R., Colpaert, P., Mannens, E., Van de Walle, R.: Querying datasets on the Web with high availability. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) Proceedings of the 13th International Semantic Web Conference. Lecture Notes in Computer Science, vol. 8796, pp. 180–196. Springer (Oct 2014) 18. Verborgh, R., Vander Sande, M., Hartig, O., Van Herwegen, J., De Vocht, L., De Meester, B., Haesendonck, G., Colpaert, P.: Triple Pattern Fragments: a low- cost knowledge graph interface for the Web. Journal of Web Semantics 37–38, 184–206 (2016), http://linkeddatafragments.org/publications/jws2016.pdf