Optimizing Federated Queries Based on the Physical Design of a Data Lake Philipp D. Rohde Maria-Esther Vidal TIB Leibniz Information Centre for Science and TIB Leibniz Information Centre for Science and Technology Technology Hannover, Germany Hannover, Germany philipp.rohde@tib.eu maria.vidal@tib.eu ABSTRACT Motivation. Considering the query in Figure 1a two different The optimization of query execution plans is known to be crucial query execution plans (QEP) can be generated. On the one hand, for reducing the query execution time. In particular, query op- the QEP in Figure 1b is unaware of the physical design. Therefore, timization has been studied thoroughly for relational databases as many operations as possible are performed at the level of the over the past decades. Recently, the Resource Description Frame- query engine. On the other hand, the QEP in Figure 1c is aware work (RDF) became popular for publishing data on the Web. As of the physical design. Hence, as many operations as possible are a consequence, federations composed of different data models pushed to the data sources. In the example query, the information like RDF and relational databases evolved. One type of these about genes and diseases is from the Diseasome data set and federations are Semantic Data Lakes where every data source is stored in a single source. Therefore, the join can be pushed down. kept in its original data model and semantically annotated with The filter expression for the scientific name of the species in ontologies or controlled vocabularies. However, state-of-the-art the Affymetrix data set is always performed at the query engine query engines for federated query processing over Semantic Data because it is not indexed. No index is created since there are Lakes often rely on optimization techniques tailored for RDF. In values that are present in more than 15% of the records. this paper, we present query optimization techniques guided This paper is organized as follows. Preliminary concepts and by heuristics that take the physical design of a Data Lake into heuristics for optimizing federated queries are discussed in Sec- account. The heuristics are implemented on top of Ontario, a tion 2. Section 3 provides a preliminary analysis. Related work is SPARQL query engine for Semantic Data Lakes. Using source- presented in Section 4. We conclude in Section 5. specific heuristics, the query engine is able to generate more effi- cient query execution plans by exploiting the knowledge about 2 OUR APPROACH indexes and normalization in relational databases. We show that heuristics which take the physical design of the Data Lake into 2.1 Preliminaries account are able to speed up query processing. In this section, we present basic concepts required to understand this work. SPARQL queries can be partitioned into groups of 1 INTRODUCTION acyclic patterns that share exactly one variable [22]. In the com- Advances in the technologies for data generation and ingestion mon case they represent a class of instances that share the same facilitate the collection of large volumes of data from where properties. Decomposing a query based on these groups leads valuable knowledge can be extracted. However, the wide variety to a QEP with star-shaped sub-queries (SSQ) in the leaves. In the of formats and data management systems available for storing motivating example (cf. Figure 1a) the SSQs are indicated with and processing the collected data, hamper interoperability and colored brackets. Following the idea of star-shaped groups over data integration. The problem of integrating data collected from subjects RDF Molecule Templates (RDF-MT) [4] are an abstract different data sources has been extensively treated in the liter- description of the properties of the entities in an RDF data set. ature [2, 11]; the mediator and wrapper architecture proposed Each RDF-MT represents one class of instances, e.g., drugs from by Wiederhold [23] and the data integration system approach Diseasome or genes from TCGA. A Data Lake is a collection of presented by Lenzerini [17], represent the basis for the state heterogeneous data sets. The data sets do not necessarily share of the art in data integration [7, 16, 19] and query processing the same data model. If data models that do not have semantics by over heterogeneous data sets or polystores [3, 5, 8, 15, 20]. Albeit nature, e.g., relational databases, are annotated with semantics, the rich variety of solutions, the problem of efficiently query- the collection of data sets is called Semantic Data Lake. RDF and ing heterogeneous data srouces remains still open because data relational databases are amongst the most frequent data models sources may differ in many various parameters, e.g., the physcial present in Semantic Data Lakes. In our case, the query engine implementation of the databases that store the data. In order to receives a SPARQL query and translates sub-queries to the native effectively solve interoperability and take advantage of the huge query language, e.g., SQL, of the data source. amount of available data, novel query processing solutions able to exploit not only logical characteristics of the data but also their 2.2 Source-Specific Heuristics physical representation, are demanded. One problem to tackle during query processing over a Data Lake is the variety of data models used throughout the Data Lake. State-of-the-art query engines use generalized optimization tech- niques or rely on heuristics tailored for one specific data model. © 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed- ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, Hence, they lose further opportunities for improving the query Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- performance. In order to enable the maximal capability of op- tribution 4.0 International (CC BY 4.0) timizing the query execution plans, the physical design of the (a) SPARQL Query (b) QEP without using indexes (c) QEP using indexes Figure 1: Query execution plans (QEP) for the same query (a); not considering indexes (b) and considering indexes (c). Optimizing QEPs with respect to the physical design of the Data Lake allows to find more efficient plans with fewer operations needed to be performed at the query engine level. Data Lake needs to be considered. This includes optimizing sub- speeding up the filter evaluation in case of fast networks at the queries for the different data models present in the Data Lake. cost of transferring larger intermediate results. We propose two heuristics designed for relational databases to The proposed heuristics follow common knowledge about show the impact of respecting the physical design. The proposed relational databases and network delays. Relational databases heuristics assume that the relational tables are normalized in are designed to find effective and efficient query execution plans 3NF. Further we expect that the subjects of a SPARQL query for joins and filter expressions exploiting indexes if beneficial. are modeled as the primary keys of the tables. Jozashoori and Even if the execution time at one source is increased by combin- Vidal [14] showed that this is the best case scenario for running ing sub-queries into one, the overall query performance might star-shaped sub-queries against relational databases. be improved in the case of a slow network by reducing the in- termediate result. Hence, the heuristics are very well suited to Heuristic 1 (Pushing down joins). Given two star-shaped sub- investigate the impact of considering the physical design of the queries over the same RDB endpoint, combine those sub-queries Data Lake during the query optimization. into one sub-query if the join attribute is indexed. Heuristic 1 is proposed since joins over indexed attributes 3 EXPERIMENT in relational databases are considered to be fast as long as the We empirically study two different kinds of query execution plans number of joins is kept reasonable. In the data we are working in order to evaluate the proposed heuristics. The QEPs are as on, star-shaped sub-queries are usually represented by less then follows: a) Physical-Design-Unaware QEP: A QEP not respecting four relational tables. Assuming the worst case for a star-shaped the physical design of the Data Lake and, therefore, not using the sub-query, three relational tables contribute to the answer. Two generated indexes to optimize the query execution. b) Physical- of those tables are connected with the remaining table via foreign Design-Aware QEP: A QEP that considers the indexes present in keys. Therefore, joining two star-shaped sub-queries leads to a the relational database. The source code and the data used for six-way join in the worst case. Hence, the number of joins can be the experiment as well as the results are available at GitHub1 . considered as reasonable. In order to decide if the number of joins Data Sets. In this experiment, we use the data sets from the is kept reasonable, a later version should consider the number LSLOD benchmark [12] which is composed of ten real-world of relational tables involved. Not only the join performance of data sets from the life sciences domain of the Linked Open Data RDB engines justifies this heuristic but also the possibility of a (LOD) cloud. The RDF version of each data set is transformed into reduced size of the intermediate result. Heuristic 1 improves the relational tables. These tables are then normalized to 3NF. Indexes query performance by reducing the time needed to perform the are created for the primary keys. Furthermore, additional indexes join as well as possibly decreasing the intermediate result. for some attributes that are used for joins or selections in the Heuristic 2 (Pushing up instantiations). Given a star-shaped queries used are generated to evaluate the impact of the proposed sub-query over a relational database, perform filters on query en- heuristics. The data of each LSLOD data set are uploaded into a gine level unless there is an index on the filtered attribute and the dedicated MySQL 5.7.24 Docker container. network speed is low. Queries. The queries provided for the LSLOD benchmark From our experience filtering string data at the query engine do not contain the possibility of pushing down the join of two performs faster compared to executing the filters in the rela- star-shaped sub-queries. Therefore, we do not use the provided tional database. Therefore, Heuristic 2 is expected to speed up queries and created five queries tailored for the heuristics to the query execution, even though a larger intermediate result show their impact on query performance. The following parame- has to be transferred to the query engine. However, if the net- ters were considered during the development of the new queries: work speed is low, the intermediate result has to be minimized a) the selectivity of the query, b) filter expressions over indexed and, therefore, the instantiation is performed at the relational database. Heuristic 2 leads to faster query execution through 1 https://github.com/SDM-TIB/Ontario-SEAData2020 (a) Physical-Design-Unaware QEPs (b) Physical-Design-Aware QEPs (c) both QEPs Figure 2: Answer Traces for Q3 with no delay and three different delays according to the gamma distributions. Answer traces show the generation of answers over time (in seconds). (a) Physical-Design-Unaware QEP not using indexes; (b) Physical-Design-Aware QEP using indexes whenever possible; and (c) both QEPs in comparison. Slow networks have a higher impact on physical-design-unaware QEPs. Respecting the physical design improves query performance. attributes, and c) possible joins of star-shaped sub-queries over in- SQL queries is not optimized for combining star-shaped sub- dexed attributes. Another factor that impacts on the performance queries. This leads to an increase in the query execution time if of a query is the size of the intermediate result. the join is pushed down. Forcing Ontario to send the optimized SQL query for Q2 approx. halves the execution time compared Network Simulation. We used four different network set- to the physical-design-unaware QEP. tings which simulate the following networks: a) No Delay: perfect Even though Heuristic 2 seems to be correct from our expe- network with no or negligible latency. b) Gamma 1: fast network rience, a deeper study on the difference of the filter execution with a gamma distribution (𝛼 = 1, 𝛽 = 0.3) of response latency performance between relational database and query engine is resulting in an average latency of 0.3 milliseconds. c) Gamma 2: needed. On the one hand, the results of Q1 support our experi- medium fast network with an average latency of 3 millisecons ence and suggest to follow Heuristic 2. On the other hand, the resulting from a gamma distribution (𝛼 = 3, 𝛽 = 1). d) Gamma 3: results of Q3 suggest otherwise. Figure 2 shows the answer gen- slow network with a gamma distribution (𝛼 = 3, 𝛽 = 1.5) leading eration for Q3 over time. It can be seen that executing the filter to an average latency of 4.5 milliseconds per message. at the relational database (physical-design-aware QEP) is faster for this query. Therefore, more studies on the filter execution Setup. For the purpose of the experiment, Ontario [5] was need to be done. Additionally, the experiment shows that the modified to run physical-design-aware QEPs and physical-design- proposed heuristics are impacted by the implementation of other unaware QEPs. Network delays are simulated within the SQL optimizations that are performed by Ontario. The impact of the wrapper of Ontario; delaying the retrieval of the next answer heuristics on the query performance is not only influenced by the from the source. The duration of the delay is calculated using physical design of the Data Lake and the network conditions but the numpy.random.gamma() function and the delay is produced also by the implementation of the query engine and wrappers. using the Python time.sleep() function. Like the data sources, Ontario is running in a Docker container. All Docker containers 4 RELATED WORK were running at the same server. Hence, network costs other Federated Databases. The problem of integrating data from than introduced by the network simulation can be neglected. dissimilar data sources has been addressed in the literature by The experiments were executed on an Ubuntu 16.04.6 LTS 64 implementing the mediator and wrapper architecture proposed bit machine with two Intel(R) Xeon(R) Platinum 8160 2.10 GHz by Wiederhold [23]. Several federated query engines have been CPUs, and 755 GiB DDR4 RAM. defined in the context of relational database [6, 10, 13, 24], as well Preliminary Results. The experiment conducts of eight dif- as diverse of integration frameworks [9]. We focus mainly on ferent configurations in total, i.e., both QEP types are evaluated approaches that implement strategies to address the problem of using all four simulated network conditions. In doing so, we en- source selection and decomposition of SPARQL queries, although, able analyzing the impact of different network conditions and we recognize the tremendous advance that the Database commu- not only the impact of physical-design-aware execution plans. nity has done to the general problem of data integration in the The analysis shows that the impact of network delays is higher last fifteen years. Existing approaches are grouped according to in the case of physical-design-unaware query execution plans. the amount of knowledge that describes the data sources, and that An analysis of the results suggests that the proposed heuristics is exploited during source selection and query decomposition to have potential to improving the query performance. However, enhance the quality of the generated query decompositions. the heuristics need to be evaluated more thoroughly and revised. Federations of RDF. With the rise of the Resource Descrip- The evaluation of Heuristic 1 is currently limited due to the query tion Framework (RDF) new federated query engines were pro- translation of Ontario. The translation of SPARQL queries into posed to optimize query processing over the new data model. FedX [21] is one of those query engines. FedX aims at minimizing REFERENCES the number of requests to be sent to the sources by identifying [1] Maribel Acosta, Maria-Esther Vidal, Tomas Lampo, Julio Castillo, and Edna groups of triple patterns that can be exclusively evaluated by Ruckhaus. 2011. ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints. In The Semantic Web – ISWC 2011. ISWC 2011. Lecture Notes in a single endpoint. ANAPSID [1] stores a list of predicates that Computer Science, Vol. 7031. Springer, Berlin, Heidelberg, 18–34. each endpoint is able to answer. Queries are decomposed into [2] AnHai Doan, Alon Y. Halevy, and Zachary G. Ives. 2012. Principles of Data Integration. Morgan Kaufmann. star-shaped sub-queries. ANAPSID introduces adaptive phys- [3] Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magda Balazinska, Bill ical operators to generate results as soon as they arrive from Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Stan the sources. These operators perform better than the traditional Zdonik. 2015. The BigDAWG Polystore System. SIGMOD Rec. 44, 2 (Aug. 2015), 11–16. blocking operators. MULDER [4] is based on ANAPSID and de- [4] Kemele M. Endris, Mikhail Galkin, Ioanna Lytra, Mohamed Nadjib Mami, scribes the sources in terms of RDF Molecule Templates (RDF- Maria-Esther Vidal, and Sören Auer. 2018. Querying Interlinked Data by MTs).MULDER is able to reduce the query execution time and Bridging RDF Molecule Templates. In Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXIX. Lecture Notes in Computer Science, increase answer completeness by using semantics in the source Vol. 11310. Springer, Berlin, Heidelberg, 1–42. descriptions during decomposition and source selection. Query [5] Kemele M. Endris, Philipp D. Rohde, Maria-Esther Vidal, and Sören Auer. 2019. Ontario: Federated Query Processing against a Semantic Data Lake. engines for federations of RDF sources can benefit from the se- In Database and Expert Systems Applications. DEXA2019. Lecture Notes in mantics of the metadata and received data. Computer Science, Vol. 11706. Springer, Cham, 379–395. [6] Daniela Florescu, Alon Levy, and Alberto Mendelzon. 1998. Database Tech- Polystores. More recently, the research focus shifted towards niques for the World-Wide Web: A Survey. SIGMOD Rec. 27, 3 (Sept. 1998), 59–74. query processing against heterogeneous data sources. Differ- [7] Behzad Golshan, Alon Y. Halevy, George A. Mihaila, and Wang-Chiew Tan. ent approaches have been proposed on how to store, integrate, 2017. Data Integration: After the Teenage Years. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, and query heterogeneous federations. SeBiDA [18] is a proof- PODS 2017, Chicago, IL, USA, May 14-19, 2017. 101–106. of-concept for a semantified big data architecture. Data sets are [8] Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An Intelli- differentiated in semantic, annotated with semantics, and non- gent Data Lake System. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June semantic. The latter can optionally be lifted with semantics if 26 - July 01, 2016. 2097–2100. mappings are provided. SeBiDA uses Apache Spark to reformat [9] Alon Halevy, Anand Rajaraman, and Joann Ordille. 2006. Data Integration: the data according to classes. The data is reformatted in Apache The Teenage Years (VLDB ’06). VLDB Endowment, 9–16. [10] Alon Y. Halevy. 2001. Answering Queries Using Views: A Survey. The VLDB Parquet tables. Therefore, the data is integrated in a centralized Journal 10, 4 (Dec. 2001), 270–294. or clustered manner and can be queried using SQL. Contrary [11] Alon Y. Halevy, Anand Rajaraman, and Joann J. Ordille. 2006. Data Integration: The Teenage Years. In Proceedings of the 32nd International Conference on Very to SeBiDA, PolyWeb [15] and Ontario [5] keep the data sources Large Data Bases, Seoul, Korea, September 12-15, 2006. 9–16. in their original data model. Data sources are queried in their [12] Ali Hasnain, Qaiser Mehmood, Syeda Sana e Zainab, Muhammad Saleem, native query language while the user sends SPARQL queries to Claude Warren, Durre Zehra, Stefan Decker, and Dietrich Rebholz-Schuhmann. 2017. BioFed: federated query processing over life sciences linked open data. the query engine. PolyWeb uses the same cost-based model as Journal of Biomedical Semantics 8, 1 (15 Mar 2017), 13:1–13:19. FedX does and predicate-based join groups to reduce the number [13] Zachary G. Ives, Alon Y. Halevy, Peter Mork, and Igor Tatarinov. 2004. Piazza: of local joins. Other than PolyWeb, Ontario is based on MULDER mediation and integration infrastructure for Semantic Web data. J. Web Semant. 1 (2004), 155–175. and uses the same plan generator extended with heuristics for [14] Samaneh Jozashoori and Maria-Esther Vidal. 2019. MapSDI: A Scaled-Up better optimization potential. Ontario also uses the physical oper- Semantic Data Integration Framework for Knowledge Graph Creation. In On the Move to Meaningful Internet Systems: OTM 2019 Conferences. OTM 2019. ators of ANAPSID. Several query processing engines have been Lecture Notes in Computer Science, Vol. 11877. Springer, Cham, 58–75. proposed, but most of them focus on a single data model for query [15] Yasar Khan, Antoine Zimmermann, Alokkumar Jha, Vijay Gadepally, Mathieu optimization and therefore miss optimization opportunities. D’Aquin, and Ratnesh Sahay. 2019. One Size Does Not Fit All: Querying Web Polystores. IEEE Access 7 (01 2019), 9598–9617. [16] Craig A. Knoblock and Pedro A. Szekely. 2015. Exploiting Semantics for Big 5 CONCLUSION Data Integration. AI Magazine 36, 1 (2015), 25–38. [17] Maurizio Lenzerini. 2002. Data Integration: A Theoretical Perspective. In In this paper we present two rather simple heuristics that aim at Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 3-5, Madison, Wisconsin, USA. 233–246. improving the query performance by considering the physical [18] Mohamed Nadjib Mami, Simon Scerri, Sören Auer, and Maria-Esther Vidal. design of the Data Lake compared to state-of-the-art query exe- 2016. Towards Semantification of Big Data Technology. In Big Data Analytics cution plans. Our heuristics take the a) presence of indexes, and and Knowledge Discovery. DaWaK 2016. Lecture Notes in Computer Science, Vol. 9829. Springer, Cham, 376–390. b) network condition into account. Even though the heuristics [19] Michalis Mountantonakis and Yannis Tzitzikas. 2019. Large-scale Semantic and their implementation are in an early stage, we can conclude Integration of Linked Data: A Survey. ACM Comput. Surv. 52, 5, Article 103 that the query performance in a Data Lake can be improved when (Sept. 2019), 40 pages. [20] Christoph Quix, Rihan Hai, and Ivan Vatov. 2016. GEMMS: A Generic and considering the characteristics of each data model. In future work, Extensible Metadata Management System for Data Lakes. In 28th International we plan to overcome the described limitations, e.g., improving Conference on Advanced Information Systems Engineering (CAiSE 2016). 129– 136. the quality of the translated SQL queries. Furthermore, we will [21] Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel, and Michael investigate the performance of different implementations of rela- Schmidt. 2011. FedX: Optimization Techniques for Federated Query Pro- tional databases in order to gain a deeper understanding of why cessing on Linked Data. In The Semantic Web – ISWC 2011. ISWC 2011. Lecture Notes in Computer Science, Vol. 7031. Springer, Berlin, Heidelberg, 601–616. filter expressions seem to perform better at query engine level in [22] María-Esther Vidal, Edna Ruckhaus, Tomas Lampo, Amadís Martínez, Javier most cases even though the intermediate results are larger in that Sierra, and Axel Polleres. 2010. Efficiently Joining Group Patterns in SPARQL case. Additionally, studying different kinds of query decompo- Queries. In The Semantic Web: Research and Applications, Lora Aroyo, Grig- oris Antoniou, Eero Hyvönen, Annette ten Teije, Heiner Stuckenschmidt, sition (e.g., triple-based instead of star-shaped sub-queries) and Liliana Cabral, and Tania Tudorache (Eds.). Springer Berlin Heidelberg, Berlin, not normalized tables is part of our plans. Heidelberg, 228–242. [23] Gio Wiederhold. 1992. Mediators in the Architecture of Future Information Systems. IEEE Computer 25, 3 (March 1992), 38–49. ACKNOWLEDGMENTS [24] Vladimir Zadorozhny, Louiqa Raschid, Maria Esther Vidal, Tolga Urhan, and Laura Bright. 2002. Efficient Evaluation of Queries in a Mediator for Web- This work has been partially supported by the EU H2020 RIA Sources (SIGMOD ’02). Association for Computing Machinery, New York, NY, funded projects QualiChain (No 822404) and iASiS (No 727658). USA, 85–96.