Querying Distributed Heterogeneous Linked Data Interfaces through Reasoning Joachim Van Herwegen Supervised by Ruben Verborgh and Erik Mannens Data Science Lab (Ghent University - iMinds) Sint-Pietersnieuwstraat 41, B-9000 Ghent, Belgium joachim.vanherwegen@ugent.be Abstract. Linked Data can be distributed through multiple interfaces on the Web, each of them with their own expressivity. However, there is no generic client available that can handle querying over multiple hetero- geneous interfaces. This increases the complexity of combining datasets and designing new interfaces with traditional approaches. Rule-based reasoning is going to be explored to combine different interfaces without intervention of a human developer. Using an iterative approach to extend Linked Data interfaces, I will evaluate different querying set-ups for the sparql language on performance, completeness and correctness. Prelimi- nary results look promising: preparatory research on query optimization using light-weight Linked Data interfaces has shown there is still room for optimization. I aim to design a generic Linked Data querying engine, capable of handling different interfaces, that can easily be extended. As my PhD is still in an early phase, I hope to narrow the scope in the next months, based on feedback of the Doctoral Consortium. 1 Problem Statement Nowadays there are multiple ways to access Linked Data. sparql endpoints and data dumps are the most popular methods [10] but alternative options such as Triple Pattern Fragments [30] are gaining in popularity. These options do not use different querying methods besides sparql, resulting in diverging access interfaces. Not all sparql endpoints support the full range of sparql functionality [9], meaning even a set of sparql endpoints can already have heterogeneous interfaces. This complicates querying datasets: different client implementations are necessary to query the full spectrum of available data. For my PhD, I consider using rule-based rdf reasoning to solve this problem. Reasoning is an important component of the Semantic Web: it allows us to draw conclusions from the available data. In the case of executing Linked Data queries the data is the actual query that needs to be executed, together with interme- diate results and possibly additional metadata provided by the endpoints. The conclusion is then how to answer this query using the available methods, meaning reasoning can be used to solve these queries by leveraging all its advantages such as pattern matching and ontology comprehension. 2 Joachim Van Herwegen 2 Relevancy Having heterogeneous Web apis has always been a problem of the Semantic Web [29]. The diversity in interfaces makes it hard to create the originally envisioned agents [7] capable of interfacing with this Web of knowledge. Querying Linked Data is now starting to face a similar problem with an increase of available interfaces [9,10,30]. For Web apis, attempts have been made to solve this problem by introducing generic description languages [27]. However, for Linked Data query interfaces there is no similar technology. This is not helped by the fact that new interfaces are still being developed [24, 26]. A generic, interface-independent solution to this problem is thus required so the Linked Data is not splintered over the Web. 3 Related Work To create the proposed system I will make use of the extensive amount of research that has already been done in multiple fields, which I categorize in Linked Data Fragments, sparql querying, federated sparql and rule-based reasoning. 3.1 Linked Data Fragments The main goal behind Linked Data Fragments [30] is to provide a uniform way to talk about the available Linked Data interfaces. Specifically, it suggests that there are many possible interfaces with varying complexity on either the server or client side between a simple data dump and a fully fledged sparql endpoint, thus providing a base for the work required. One additional contribution was the Triple Pattern Fragments (tpf) [30] interface. This interface aims to reduce the server load of executing sparql queries by only supporting single triple pattern requests and offloading the remaining execution complexity to the client. Several extensions have already been suggested for the tpf, two of which I contributed to [24, 26], again moving some complexity back to the server without having too much impact on the performance. This effectively increases the number of available interfaces to access Linked Data, making even more apparent the requirement for a generic method to query these. 3.2 sparql sparql [11] is the current standard for querying rdf [13] data. It allows the user to request Linked Data results through sparql endpoints. Although the sparql functionality is described in the specification, it is not fully supported by all of these endpoints [9]. Much research has already been done on how to execute queries efficiently. This can be a complex problem depending on the operators used in the query [17]. Several optimizations have already been proposed, such as focusing on rewriting Querying Heterogeneous Linked Data Interfaces 3 sparql algebra [21], ordering operations based on selectivity [23] or creating an optimization model for the entire sparql execution process [12]. For my suggested approach, I will base the query execution on this already existing research. Many of these optimizations are dataset independent and focus solely on the query composition itself, allowing for re-use in any sparql query engine. 3.3 Federated sparql Federated querying aims to tackle the problem of having to use multiple data sources to answer a (sparql) query. Two major components of this are source discovery and source selection [14]. For the work described in this paper, source selection is the most important part, on which a lot of research has already been done [18]. This is a complex issue with a lot of separate components that can be optimized [19]. 3.4 Rule-based Reasoning Notation3 (n) [6] is an extension of the rdf model. It offers several additional features such as the possibility to refer to graphs, write logical rules that can be used for reasoning, and express proofs of these reasoning results. There are several reasoners supporting n, with the main ones being cwm [5] and eye [28]. An important part of reasoning over Linked Data are the owl ontologies [4], which allow for a description of conclusions that can be drawn from Linked Data. Traditionally, the reasoning over owl ontologies is done by Description Logic based reasoners using the tableaux algorithm [22]. However, one of the owl profiles [16], owl-rl, was designed to support rule-based reasoners in coping with owl ontologies. The main reasons I choose for rule-based reasoning are performance [3] and the proof generation done by n reasoners, since I propose to base query plans on the generated proofs. 4 Research Questions My main research question is: – How can reasoning improve query plans for accessing distributed heterogeneous Linked Data interfaces? Important corresponding sub-questions are: – How can we describe sparql queries and ldf endpoints formally in rdf? – How can we execute and optimize sparql queries using these descriptions? – How can we extend this process to federated querying? 4 Joachim Van Herwegen 5 Hypotheses My hypotheses are the following: – The decision generated by the reasoner corresponds to the valid query plan and the proof is an explanation of why the plan was chosen. – sparql queries can be described with an rdf representation of the correspond- ing sparql algebra. An ontology can be made to describe the expressiveness of common Linked Data interfaces. – The proposed system will provide both complete and correct results. – Existing research in source selection for federated querying can be applied to support multiple data sources while maintaining comparable execution times to existing frameworks. 6 Preliminary results A sub-problem of querying heterogeneous interfaces is executing a sparql query on a tpf platform: multiple calls are necessary to fully execute the query and performance can differ greatly based on the chosen plan [30]. There have already been several publications providing optimized solutions for this problem [1, 25], indicating that there is definitely room for optimization. In my paper [25] I focused on how the number of requests can be optimized and how tpf data can be joined more efficiently to increase query answering performance. Many of the underlying ideas can be reused in the framework suggested here. Related to that is using reasoning to combine multiple services to produce a requested result as described for the restdesc structure [27]. This is a framework that formally describes web apis and uses reasoning to create workflows through these apis. Results there have shown that this algorithm can scale for many services and proof lengths, meaning that the proposed system could handle larger queries and datasets. 7 Approach I propose an iterative approach to tackle this problem, which will start out with a single interface such as tpf and only the base features of a sparql query, such as bgp matching. This reduces the initial complexity of the problem but still requires several necessary components to be built. sparql query description Before any reasoning can be done, the reasoner has to understand the query. This means it needs an rdf representation of the actual query. There already are ontologies supporting this, such as spin. These focus on the actual sparql representation though. I propose to work with sparql algebra instead. The algebra breaks a query down to its base components, that can then be assigned to interfaces able to solve them, and to represent that algebra in rdf. Querying Heterogeneous Linked Data Interfaces 5 Interface description To reason over heterogeneous interfaces there needs to be a way to describe their functionality: the reasoner has to understand which parts of a query this specific interface can answer. This needs to be done in such a way that the functionality can be matched with query components as described above. I am currently investigating the Hydra vocabulary which was designed to describe web services [15]. Some parts of the vocabulary are already used as metadata in tpf endpoints so those can be reused. An essential part of my research in respect to interfaces descriptions is related to the rules that combine Hydra with the sparql algebra components, indicating what the response looks like when calling a specific interface. Initial plan optimization There already exist multiple optimization techniques for sparql queries as mentioned in Section 3.2. Optimizations that focus on sparql query rewriting will be highly relevant, while those depending on local indexes on the data are less likely to be used. In the case of tpf and possibly other interfaces we are provided with additional metadata which can also influence the query plan. In the case of a single sparql endpoint the plan is rather simple: send the query to that endpoint. However, in case of multiple endpoints or more restrictive interfaces the query will need to be split up into multiple calls. Adaptive query optimization In case the query execution is split up in multiple calls, these will provide intermediate results, which can also influence the query plan. If results indicate that a component is more selective it is more interesting to focus on that one first for example. This means that the ordering of the plan is also quite important: the results of one component of the query influence the bindings for another component if they share a common variable, making that one more selective. Since it is impossible to completely determine this in advance, the system will have to adapt at runtime to those intermediate results. Additional rules will be necessary to support this, allowing the reasoner to take this information into account. Extending the system Once the system works for tpf interfaces and simple queries, it will be extended to additional interfaces and sparql operators. The challenge there will be to make sure all interface functionality can be described adequately. It is impossible to predict all the necessary components since it also needs to support future interfaces, but the system should be modular enough that it can be extended once new descriptions are required. Multiple data sources So far I have only described solutions with a single interface which require multiple requests to execute a query. In that case the single interface provides all the necessary data to solve that query. However, this is no longer the case when we use the interfaces of multiple sources. To solve this the descriptions need to be extended with additional metadata describing the contents of the endpoint where possible. Here I will make use of the already existing work in federated querying. 6 Joachim Van Herwegen 8 Evaluation Plan The proposed system will be evaluated by comparing it to existing querying technologies. Since I suggested an iterative approach, starting with a more simple system which is then extended, the evaluations will also be iterative: each new extension will require a complete evaluation to determine its influence on the complete system. I plan to make use of existing sparql benchmarks [2, 8, 20]. The evaluation will be done in multiple steps to make sure all bases are covered for having a functioning query engine. Correctness The first step of the evaluation will be verifying the correctness of the algorithm: make sure the returned results are actually correct answers to the query. Completeness The second important component to have a working sparql engine is making sure it returns all the results. This might not always be necessary, especially in the case of federated queries, but for the initial implementation on a single data source with simple queries the answer needs to be complete. Optimization Once the framework passes the two steps mentioned above, I will investigate the performance: how does it compare to existing technologies and sparql engines? In the case of a single tpf interface there are already multiple implementations available for comparison [1, 25, 30]. The case of a single sparql endpoint is trivial since that simply involves sending the query to the endpoint. Once multiple data sources are used there are already several existing implementations to compare with [18]. There are multiple features that need to be covered in performance tests, such as response time, bandwidth requirements and the influence of the number or size of datasets. 9 Reflections My first paper on query optimization for tpf [25] introduced me to the vast amount of work that has been done in the field of sparql query optimization. One problem with tpf is that since it is a new technology, new optimizations would have to be researched since it provided a novel way to execute sparql queries. It is possible that much of the original research on sparql query optimization can be reused, but this has to be investigated on a case by case basis. This inspired me to propose a framework for generic sparql query execution. There are many challenges that need to be tackled before the idea can come to fruition. One of the main hypotheses I build upon is the ability to create adequate rules and descriptions so query plans can be generated using a reasoning engine. Working with restdesc has shown me that reasoning can be quite powerful for the Semantic Web. In case I am unable to find an adequate solution, the system Querying Heterogeneous Linked Data Interfaces 7 proposed can still be built in a non-reasoning environment, although it would lose all reasoning advantages. The second big hurdle is the combination of multiple interfaces. Although source selection has already been extensively researched as shown in Section 3.3, the combination of both different sources and different interfaces provides a new level of complexity. Assuming the remainder of the framework works correctly, providing a solution will definitely be possible. However, providing an acceptably optimized solution will be much harder. It is still unclear to me how to best tackle this, which is why I look forward to discussing this during the Doctoral Consortium. References 1. Acosta, M., Vidal, M.E.: Networks of linked data eddies: An adaptive web query processing engine for rdf data. In: The Semantic Web-ISWC 2015, pp. 111–127 (2015) 2. Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified stress testing of rdf data management systems. In: The Semantic Web–ISWC 2014, pp. 197–212 (2014) 3. Arndt, D., De Meester, B., Bonte, P., Schaballie, J., Bhatti, J., Dereuddre, W., Verborgh, R., Ongenae, F., De Turck, F., Van de Walle, R., Mannens, E.: Ontology reasoning using rules in an eHealth context. In: Rule Technologies: Foundations, Tools, and Applications. Lecture Notes in Computer Science, vol. 9202, pp. 465–472. Springer (Jul 2015) 4. Bechhofer, S.: owl: Web ontology language. In: Encyclopedia of Database Systems, pp. 2008–2009. Springer (2009) 5. Berners-Lee, T.: cwm - a general-purpose data processor for the semantic web. Tech. rep. (2000), https://www.w3.org/2000/10/swap/doc/cwm.html 6. Berners-Lee, T., Connolly, D., Kagal, L., Scharf, Y., Hendler, J.: N3logic: A logical framework for the world wide web. CoRR abs/0711.1533 (2007) 7. Berners-Lee, T., Hendler, J., Lassila, O., et al.: The semantic web. Scientific american 284(5), 28–37 (2001) 8. Bizer, C., Schultz, A.: The Berlin sparql benchmark. International Journal on Semantic Web and Information Systems 5(2), 1–24 (2009) 9. Buil-Aranda, C., Hogan, A., Umbrich, J., Vandenbussche, P.Y.: sparqlWeb- querying infrastructure: Ready for action? In: Proceedings of the 12th International Semantic Web Conference (Nov 2013) 10. Ermilov, I., Martin, M., Lehmann, J., Auer, S.: Linked open data statistics: Col- lection and exploitation. In: Knowledge Engineering and the Semantic Web, pp. 242–249. Springer (2013) 11. Harris, S., Seaborne, A.: sparql . query language. Recommendation, World Wide Web Consortium (Mar 2013), http://www.w3.org/TR/sparql11-query/ 12. Hartig, O., Heese, R.: The sparql query graph model for query optimization. In: The Semantic Web: Research and Applications, pp. 564–578. Springer (2007) 13. Hayes, P., Patel-Schneider, P.: rdf 1.1 semantics. W3C recommendation, W3C (Feb 2014), http://www.w3.org/TR/2014/REC-rdf11-mt-20140225/ 14. Heath, T., Bizer, C.: Linked data: Evolving the web into a global data space. Synthesis lectures on the semantic web: theory and technology 1(1), 1–136 (2011) 15. Lanthaler, M., Gütl, C.: Hydra: A vocabulary for hypermedia-driven Web apis. In: Proceedings of the 6th Workshop on Linked Data on the Web (May 2013) 8 Joachim Van Herwegen 16. Motik, B., Grau, B.C., Horrocks, I., Wu, Z., Fokoue, A., Lutz, C., et al.: owl 2 web ontology language: Profiles. W3C recommendation 27, 61 (2009) 17. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of sparql. In: International semantic web conference. vol. 4273, pp. 30–43. Springer (2006) 18. Rakhmawati, N.A., Umbrich, J., Karnstedt, M., Hasnain, A., Hausenblas, M.: Querying over federated sparql endpoints - A state of the art survey. CoRR abs/1306.1723 (2013) 19. Saleem, M., Khan, Y., Hasnain, A., Ermilov, I., Ngonga Ngomo, A.C.: A fine-grained evaluation of sparql endpoint federation systems. Semantic Web (Preprint), 1–26 (2015) 20. Schmidt, M., Hornung, T., Meier, M., Pinkel, C., Lausen, G.: SP2 Bench: A sparql performance benchmark. In: Semantic Web Information Management, pp. 371–393. Springer (2010) 21. Schmidt, M., Meier, M., Lausen, G.: Foundations of sparql query optimization. In: Proceedings of the 13th International Conference on Database Theory. pp. 4–33. ACM (2010) 22. Sirin, E., Parsia, B., Grau, B.C., Kalyanpur, A., Katz, Y.: Pellet: A practical owl-dl reasoner. Web Semantics: science, services and agents on the World Wide Web 5(2), 51–53 (2007) 23. Stocker, M., Seaborne, A., Bernstein, A., Kiefer, C., Reynolds, D.: sparql basic graph pattern optimization using selectivity estimation. In: Proceedings of the 17th international conference on World Wide Web. pp. 595–604. ACM (2008) 24. Van Herwegen, J., De Vocht, L., Verborgh, R., Mannens, E., Van de Walle, R.: Substring filtering for low-cost Linked Data interfaces. In: The Semantic Web – ISWC 2015. Lecture Notes in Computer Science, vol. 9366, pp. 128–143. Springer (Oct 2015) 25. Van Herwegen, J., Verborgh, R., Mannens, E., Van de Walle, R.: Query execution optimization for clients of triple pattern fragments. In: Proceedings of the 12th Extended Semantic Web Conference (Jun 2015) 26. Vander Sande, M., Verborgh, R., Van Herwegen, J., Mannens, E., Van de Walle, R.: Opportunistic Linked Data querying through approximate membership metadata. In: The Semantic Web – ISWC 2015. Lecture Notes in Computer Science, vol. 9366, pp. 92–110. Springer (Oct 2015) 27. Verborgh, R., Arndt, D., Van Hoecke, S., De Roo, J., Mels, G., Steiner, T., Gabarró Vallés, J.: The pragmatic proof: Hypermedia API composition and execu- tion. Theory and Practice of Logic Programming (2016) 28. Verborgh, R., De Roo, J.: Drawing conclusions from Linked Data on the Web. IEEE Software 32(5), 23–27 (May 2015) 29. Verborgh, R., Mannens, E., Van de Walle, R.: Bottom-up web apis with self- descriptive responses. In: Proceedings of the First Karlsruhe Service Summit Re- search Workshop (Feb 2015) 30. Verborgh, R., Vander Sande, M., Hartig, O., Van Herwegen, J., De Vocht, L., De Meester, B., Haesendonck, G., Colpaert, P.: Triple Pattern Fragments: a low- cost knowledge graph interface for the Web. Journal of Web Semantics 37–38, 184–206 (2016)