=Paper= {{Paper |id=Vol-2578/SEAData6 |storemode=property |title=Optimizing Federated Queries Based on the Physical Design of a Data Lake |pdfUrl=https://ceur-ws.org/Vol-2578/SEAData6.pdf |volume=Vol-2578 |authors=Philipp D. Rohde,Maria-Esther Vidal |dblpUrl=https://dblp.org/rec/conf/edbt/RohdeV20 }} ==Optimizing Federated Queries Based on the Physical Design of a Data Lake== https://ceur-ws.org/Vol-2578/SEAData6.pdf
    Optimizing Federated Queries Based on the Physical Design
                         of a Data Lake
                              Philipp D. Rohde                                                          Maria-Esther Vidal
          TIB Leibniz Information Centre for Science and                                  TIB Leibniz Information Centre for Science and
                            Technology                                                                     Technology
                        Hannover, Germany                                                               Hannover, Germany
                       philipp.rohde@tib.eu                                                             maria.vidal@tib.eu

ABSTRACT                                                                                  Motivation. Considering the query in Figure 1a two different
The optimization of query execution plans is known to be crucial                       query execution plans (QEP) can be generated. On the one hand,
for reducing the query execution time. In particular, query op-                        the QEP in Figure 1b is unaware of the physical design. Therefore,
timization has been studied thoroughly for relational databases                        as many operations as possible are performed at the level of the
over the past decades. Recently, the Resource Description Frame-                       query engine. On the other hand, the QEP in Figure 1c is aware
work (RDF) became popular for publishing data on the Web. As                           of the physical design. Hence, as many operations as possible are
a consequence, federations composed of different data models                           pushed to the data sources. In the example query, the information
like RDF and relational databases evolved. One type of these                           about genes and diseases is from the Diseasome data set and
federations are Semantic Data Lakes where every data source is                         stored in a single source. Therefore, the join can be pushed down.
kept in its original data model and semantically annotated with                        The filter expression for the scientific name of the species in
ontologies or controlled vocabularies. However, state-of-the-art                       the Affymetrix data set is always performed at the query engine
query engines for federated query processing over Semantic Data                        because it is not indexed. No index is created since there are
Lakes often rely on optimization techniques tailored for RDF. In                       values that are present in more than 15% of the records.
this paper, we present query optimization techniques guided                               This paper is organized as follows. Preliminary concepts and
by heuristics that take the physical design of a Data Lake into                        heuristics for optimizing federated queries are discussed in Sec-
account. The heuristics are implemented on top of Ontario, a                           tion 2. Section 3 provides a preliminary analysis. Related work is
SPARQL query engine for Semantic Data Lakes. Using source-                             presented in Section 4. We conclude in Section 5.
specific heuristics, the query engine is able to generate more effi-
cient query execution plans by exploiting the knowledge about
                                                                                       2 OUR APPROACH
indexes and normalization in relational databases. We show that
heuristics which take the physical design of the Data Lake into                        2.1 Preliminaries
account are able to speed up query processing.                                         In this section, we present basic concepts required to understand
                                                                                       this work. SPARQL queries can be partitioned into groups of
1    INTRODUCTION                                                                      acyclic patterns that share exactly one variable [22]. In the com-
Advances in the technologies for data generation and ingestion                         mon case they represent a class of instances that share the same
facilitate the collection of large volumes of data from where                          properties. Decomposing a query based on these groups leads
valuable knowledge can be extracted. However, the wide variety                         to a QEP with star-shaped sub-queries (SSQ) in the leaves. In the
of formats and data management systems available for storing                           motivating example (cf. Figure 1a) the SSQs are indicated with
and processing the collected data, hamper interoperability and                         colored brackets. Following the idea of star-shaped groups over
data integration. The problem of integrating data collected from                       subjects RDF Molecule Templates (RDF-MT) [4] are an abstract
different data sources has been extensively treated in the liter-                      description of the properties of the entities in an RDF data set.
ature [2, 11]; the mediator and wrapper architecture proposed                          Each RDF-MT represents one class of instances, e.g., drugs from
by Wiederhold [23] and the data integration system approach                            Diseasome or genes from TCGA. A Data Lake is a collection of
presented by Lenzerini [17], represent the basis for the state                         heterogeneous data sets. The data sets do not necessarily share
of the art in data integration [7, 16, 19] and query processing                        the same data model. If data models that do not have semantics by
over heterogeneous data sets or polystores [3, 5, 8, 15, 20]. Albeit                   nature, e.g., relational databases, are annotated with semantics,
the rich variety of solutions, the problem of efficiently query-                       the collection of data sets is called Semantic Data Lake. RDF and
ing heterogeneous data srouces remains still open because data                         relational databases are amongst the most frequent data models
sources may differ in many various parameters, e.g., the physcial                      present in Semantic Data Lakes. In our case, the query engine
implementation of the databases that store the data. In order to                       receives a SPARQL query and translates sub-queries to the native
effectively solve interoperability and take advantage of the huge                      query language, e.g., SQL, of the data source.
amount of available data, novel query processing solutions able
to exploit not only logical characteristics of the data but also their                 2.2    Source-Specific Heuristics
physical representation, are demanded.
                                                                                       One problem to tackle during query processing over a Data Lake
                                                                                       is the variety of data models used throughout the Data Lake.
                                                                                       State-of-the-art query engines use generalized optimization tech-
                                                                                       niques or rely on heuristics tailored for one specific data model.
© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed-
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,       Hence, they lose further opportunities for improving the query
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-              performance. In order to enable the maximal capability of op-
tribution 4.0 International (CC BY 4.0)
                                                                                       timizing the query execution plans, the physical design of the
                   (a) SPARQL Query                              (b) QEP without using indexes                   (c) QEP using indexes

Figure 1: Query execution plans (QEP) for the same query (a); not considering indexes (b) and considering indexes (c).
Optimizing QEPs with respect to the physical design of the Data Lake allows to find more efficient plans with fewer
operations needed to be performed at the query engine level.


Data Lake needs to be considered. This includes optimizing sub-           speeding up the filter evaluation in case of fast networks at the
queries for the different data models present in the Data Lake.           cost of transferring larger intermediate results.
We propose two heuristics designed for relational databases to               The proposed heuristics follow common knowledge about
show the impact of respecting the physical design. The proposed           relational databases and network delays. Relational databases
heuristics assume that the relational tables are normalized in            are designed to find effective and efficient query execution plans
3NF. Further we expect that the subjects of a SPARQL query                for joins and filter expressions exploiting indexes if beneficial.
are modeled as the primary keys of the tables. Jozashoori and             Even if the execution time at one source is increased by combin-
Vidal [14] showed that this is the best case scenario for running         ing sub-queries into one, the overall query performance might
star-shaped sub-queries against relational databases.                     be improved in the case of a slow network by reducing the in-
                                                                          termediate result. Hence, the heuristics are very well suited to
Heuristic 1 (Pushing down joins). Given two star-shaped sub-              investigate the impact of considering the physical design of the
queries over the same RDB endpoint, combine those sub-queries             Data Lake during the query optimization.
into one sub-query if the join attribute is indexed.
   Heuristic 1 is proposed since joins over indexed attributes            3    EXPERIMENT
in relational databases are considered to be fast as long as the          We empirically study two different kinds of query execution plans
number of joins is kept reasonable. In the data we are working            in order to evaluate the proposed heuristics. The QEPs are as
on, star-shaped sub-queries are usually represented by less then          follows: a) Physical-Design-Unaware QEP: A QEP not respecting
four relational tables. Assuming the worst case for a star-shaped         the physical design of the Data Lake and, therefore, not using the
sub-query, three relational tables contribute to the answer. Two          generated indexes to optimize the query execution. b) Physical-
of those tables are connected with the remaining table via foreign        Design-Aware QEP: A QEP that considers the indexes present in
keys. Therefore, joining two star-shaped sub-queries leads to a           the relational database. The source code and the data used for
six-way join in the worst case. Hence, the number of joins can be         the experiment as well as the results are available at GitHub1 .
considered as reasonable. In order to decide if the number of joins
                                                                             Data Sets. In this experiment, we use the data sets from the
is kept reasonable, a later version should consider the number
                                                                          LSLOD benchmark [12] which is composed of ten real-world
of relational tables involved. Not only the join performance of
                                                                          data sets from the life sciences domain of the Linked Open Data
RDB engines justifies this heuristic but also the possibility of a
                                                                          (LOD) cloud. The RDF version of each data set is transformed into
reduced size of the intermediate result. Heuristic 1 improves the
                                                                          relational tables. These tables are then normalized to 3NF. Indexes
query performance by reducing the time needed to perform the
                                                                          are created for the primary keys. Furthermore, additional indexes
join as well as possibly decreasing the intermediate result.
                                                                          for some attributes that are used for joins or selections in the
Heuristic 2 (Pushing up instantiations). Given a star-shaped              queries used are generated to evaluate the impact of the proposed
sub-query over a relational database, perform filters on query en-        heuristics. The data of each LSLOD data set are uploaded into a
gine level unless there is an index on the filtered attribute and the     dedicated MySQL 5.7.24 Docker container.
network speed is low.                                                        Queries. The queries provided for the LSLOD benchmark
   From our experience filtering string data at the query engine          do not contain the possibility of pushing down the join of two
performs faster compared to executing the filters in the rela-            star-shaped sub-queries. Therefore, we do not use the provided
tional database. Therefore, Heuristic 2 is expected to speed up           queries and created five queries tailored for the heuristics to
the query execution, even though a larger intermediate result             show their impact on query performance. The following parame-
has to be transferred to the query engine. However, if the net-           ters were considered during the development of the new queries:
work speed is low, the intermediate result has to be minimized            a) the selectivity of the query, b) filter expressions over indexed
and, therefore, the instantiation is performed at the relational
database. Heuristic 2 leads to faster query execution through                 1 https://github.com/SDM-TIB/Ontario-SEAData2020
    (a) Physical-Design-Unaware QEPs                  (b) Physical-Design-Aware QEPs                          (c) both QEPs

Figure 2: Answer Traces for Q3 with no delay and three different delays according to the gamma distributions. Answer
traces show the generation of answers over time (in seconds). (a) Physical-Design-Unaware QEP not using indexes; (b)
Physical-Design-Aware QEP using indexes whenever possible; and (c) both QEPs in comparison. Slow networks have a
higher impact on physical-design-unaware QEPs. Respecting the physical design improves query performance.


attributes, and c) possible joins of star-shaped sub-queries over in-   SQL queries is not optimized for combining star-shaped sub-
dexed attributes. Another factor that impacts on the performance        queries. This leads to an increase in the query execution time if
of a query is the size of the intermediate result.                      the join is pushed down. Forcing Ontario to send the optimized
                                                                        SQL query for Q2 approx. halves the execution time compared
   Network Simulation. We used four different network set-              to the physical-design-unaware QEP.
tings which simulate the following networks: a) No Delay: perfect          Even though Heuristic 2 seems to be correct from our expe-
network with no or negligible latency. b) Gamma 1: fast network         rience, a deeper study on the difference of the filter execution
with a gamma distribution (𝛼 = 1, 𝛽 = 0.3) of response latency          performance between relational database and query engine is
resulting in an average latency of 0.3 milliseconds. c) Gamma 2:        needed. On the one hand, the results of Q1 support our experi-
medium fast network with an average latency of 3 millisecons            ence and suggest to follow Heuristic 2. On the other hand, the
resulting from a gamma distribution (𝛼 = 3, 𝛽 = 1). d) Gamma 3:         results of Q3 suggest otherwise. Figure 2 shows the answer gen-
slow network with a gamma distribution (𝛼 = 3, 𝛽 = 1.5) leading         eration for Q3 over time. It can be seen that executing the filter
to an average latency of 4.5 milliseconds per message.                  at the relational database (physical-design-aware QEP) is faster
                                                                        for this query. Therefore, more studies on the filter execution
   Setup. For the purpose of the experiment, Ontario [5] was            need to be done. Additionally, the experiment shows that the
modified to run physical-design-aware QEPs and physical-design-         proposed heuristics are impacted by the implementation of other
unaware QEPs. Network delays are simulated within the SQL               optimizations that are performed by Ontario. The impact of the
wrapper of Ontario; delaying the retrieval of the next answer           heuristics on the query performance is not only influenced by the
from the source. The duration of the delay is calculated using          physical design of the Data Lake and the network conditions but
the numpy.random.gamma() function and the delay is produced             also by the implementation of the query engine and wrappers.
using the Python time.sleep() function. Like the data sources,
Ontario is running in a Docker container. All Docker containers         4    RELATED WORK
were running at the same server. Hence, network costs other
                                                                           Federated Databases. The problem of integrating data from
than introduced by the network simulation can be neglected.
                                                                        dissimilar data sources has been addressed in the literature by
The experiments were executed on an Ubuntu 16.04.6 LTS 64
                                                                        implementing the mediator and wrapper architecture proposed
bit machine with two Intel(R) Xeon(R) Platinum 8160 2.10 GHz
                                                                        by Wiederhold [23]. Several federated query engines have been
CPUs, and 755 GiB DDR4 RAM.
                                                                        defined in the context of relational database [6, 10, 13, 24], as well
   Preliminary Results. The experiment conducts of eight dif-           as diverse of integration frameworks [9]. We focus mainly on
ferent configurations in total, i.e., both QEP types are evaluated      approaches that implement strategies to address the problem of
using all four simulated network conditions. In doing so, we en-        source selection and decomposition of SPARQL queries, although,
able analyzing the impact of different network conditions and           we recognize the tremendous advance that the Database commu-
not only the impact of physical-design-aware execution plans.           nity has done to the general problem of data integration in the
The analysis shows that the impact of network delays is higher          last fifteen years. Existing approaches are grouped according to
in the case of physical-design-unaware query execution plans.           the amount of knowledge that describes the data sources, and that
An analysis of the results suggests that the proposed heuristics        is exploited during source selection and query decomposition to
have potential to improving the query performance. However,             enhance the quality of the generated query decompositions.
the heuristics need to be evaluated more thoroughly and revised.           Federations of RDF. With the rise of the Resource Descrip-
The evaluation of Heuristic 1 is currently limited due to the query     tion Framework (RDF) new federated query engines were pro-
translation of Ontario. The translation of SPARQL queries into          posed to optimize query processing over the new data model.
FedX [21] is one of those query engines. FedX aims at minimizing      REFERENCES
the number of requests to be sent to the sources by identifying        [1] Maribel Acosta, Maria-Esther Vidal, Tomas Lampo, Julio Castillo, and Edna
groups of triple patterns that can be exclusively evaluated by             Ruckhaus. 2011. ANAPSID: An Adaptive Query Processing Engine for SPARQL
                                                                           Endpoints. In The Semantic Web – ISWC 2011. ISWC 2011. Lecture Notes in
a single endpoint. ANAPSID [1] stores a list of predicates that            Computer Science, Vol. 7031. Springer, Berlin, Heidelberg, 18–34.
each endpoint is able to answer. Queries are decomposed into           [2] AnHai Doan, Alon Y. Halevy, and Zachary G. Ives. 2012. Principles of Data
                                                                           Integration. Morgan Kaufmann.
star-shaped sub-queries. ANAPSID introduces adaptive phys-             [3] Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magda Balazinska, Bill
ical operators to generate results as soon as they arrive from             Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Stan
the sources. These operators perform better than the traditional           Zdonik. 2015. The BigDAWG Polystore System. SIGMOD Rec. 44, 2 (Aug.
                                                                           2015), 11–16.
blocking operators. MULDER [4] is based on ANAPSID and de-             [4] Kemele M. Endris, Mikhail Galkin, Ioanna Lytra, Mohamed Nadjib Mami,
scribes the sources in terms of RDF Molecule Templates (RDF-               Maria-Esther Vidal, and Sören Auer. 2018. Querying Interlinked Data by
MTs).MULDER is able to reduce the query execution time and                 Bridging RDF Molecule Templates. In Transactions on Large-Scale Data-
                                                                           and Knowledge-Centered Systems XXXIX. Lecture Notes in Computer Science,
increase answer completeness by using semantics in the source              Vol. 11310. Springer, Berlin, Heidelberg, 1–42.
descriptions during decomposition and source selection. Query          [5] Kemele M. Endris, Philipp D. Rohde, Maria-Esther Vidal, and Sören Auer.
                                                                           2019. Ontario: Federated Query Processing against a Semantic Data Lake.
engines for federations of RDF sources can benefit from the se-            In Database and Expert Systems Applications. DEXA2019. Lecture Notes in
mantics of the metadata and received data.                                 Computer Science, Vol. 11706. Springer, Cham, 379–395.
                                                                       [6] Daniela Florescu, Alon Levy, and Alberto Mendelzon. 1998. Database Tech-
   Polystores. More recently, the research focus shifted towards           niques for the World-Wide Web: A Survey. SIGMOD Rec. 27, 3 (Sept. 1998),
                                                                           59–74.
query processing against heterogeneous data sources. Differ-           [7] Behzad Golshan, Alon Y. Halevy, George A. Mihaila, and Wang-Chiew Tan.
ent approaches have been proposed on how to store, integrate,              2017. Data Integration: After the Teenage Years. In Proceedings of the 36th
                                                                           ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems,
and query heterogeneous federations. SeBiDA [18] is a proof-               PODS 2017, Chicago, IL, USA, May 14-19, 2017. 101–106.
of-concept for a semantified big data architecture. Data sets are      [8] Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An Intelli-
differentiated in semantic, annotated with semantics, and non-             gent Data Lake System. In Proceedings of the 2016 International Conference on
                                                                           Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June
semantic. The latter can optionally be lifted with semantics if            26 - July 01, 2016. 2097–2100.
mappings are provided. SeBiDA uses Apache Spark to reformat            [9] Alon Halevy, Anand Rajaraman, and Joann Ordille. 2006. Data Integration:
the data according to classes. The data is reformatted in Apache           The Teenage Years (VLDB ’06). VLDB Endowment, 9–16.
                                                                      [10] Alon Y. Halevy. 2001. Answering Queries Using Views: A Survey. The VLDB
Parquet tables. Therefore, the data is integrated in a centralized         Journal 10, 4 (Dec. 2001), 270–294.
or clustered manner and can be queried using SQL. Contrary            [11] Alon Y. Halevy, Anand Rajaraman, and Joann J. Ordille. 2006. Data Integration:
                                                                           The Teenage Years. In Proceedings of the 32nd International Conference on Very
to SeBiDA, PolyWeb [15] and Ontario [5] keep the data sources              Large Data Bases, Seoul, Korea, September 12-15, 2006. 9–16.
in their original data model. Data sources are queried in their       [12] Ali Hasnain, Qaiser Mehmood, Syeda Sana e Zainab, Muhammad Saleem,
native query language while the user sends SPARQL queries to               Claude Warren, Durre Zehra, Stefan Decker, and Dietrich Rebholz-Schuhmann.
                                                                           2017. BioFed: federated query processing over life sciences linked open data.
the query engine. PolyWeb uses the same cost-based model as                Journal of Biomedical Semantics 8, 1 (15 Mar 2017), 13:1–13:19.
FedX does and predicate-based join groups to reduce the number        [13] Zachary G. Ives, Alon Y. Halevy, Peter Mork, and Igor Tatarinov. 2004. Piazza:
of local joins. Other than PolyWeb, Ontario is based on MULDER             mediation and integration infrastructure for Semantic Web data. J. Web
                                                                           Semant. 1 (2004), 155–175.
and uses the same plan generator extended with heuristics for         [14] Samaneh Jozashoori and Maria-Esther Vidal. 2019. MapSDI: A Scaled-Up
better optimization potential. Ontario also uses the physical oper-        Semantic Data Integration Framework for Knowledge Graph Creation. In On
                                                                           the Move to Meaningful Internet Systems: OTM 2019 Conferences. OTM 2019.
ators of ANAPSID. Several query processing engines have been               Lecture Notes in Computer Science, Vol. 11877. Springer, Cham, 58–75.
proposed, but most of them focus on a single data model for query     [15] Yasar Khan, Antoine Zimmermann, Alokkumar Jha, Vijay Gadepally, Mathieu
optimization and therefore miss optimization opportunities.                D’Aquin, and Ratnesh Sahay. 2019. One Size Does Not Fit All: Querying Web
                                                                           Polystores. IEEE Access 7 (01 2019), 9598–9617.
                                                                      [16] Craig A. Knoblock and Pedro A. Szekely. 2015. Exploiting Semantics for Big
5   CONCLUSION                                                             Data Integration. AI Magazine 36, 1 (2015), 25–38.
                                                                      [17] Maurizio Lenzerini. 2002. Data Integration: A Theoretical Perspective. In
In this paper we present two rather simple heuristics that aim at          Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on
                                                                           Principles of Database Systems, June 3-5, Madison, Wisconsin, USA. 233–246.
improving the query performance by considering the physical           [18] Mohamed Nadjib Mami, Simon Scerri, Sören Auer, and Maria-Esther Vidal.
design of the Data Lake compared to state-of-the-art query exe-            2016. Towards Semantification of Big Data Technology. In Big Data Analytics
cution plans. Our heuristics take the a) presence of indexes, and          and Knowledge Discovery. DaWaK 2016. Lecture Notes in Computer Science,
                                                                           Vol. 9829. Springer, Cham, 376–390.
b) network condition into account. Even though the heuristics         [19] Michalis Mountantonakis and Yannis Tzitzikas. 2019. Large-scale Semantic
and their implementation are in an early stage, we can conclude            Integration of Linked Data: A Survey. ACM Comput. Surv. 52, 5, Article 103
that the query performance in a Data Lake can be improved when             (Sept. 2019), 40 pages.
                                                                      [20] Christoph Quix, Rihan Hai, and Ivan Vatov. 2016. GEMMS: A Generic and
considering the characteristics of each data model. In future work,        Extensible Metadata Management System for Data Lakes. In 28th International
we plan to overcome the described limitations, e.g., improving             Conference on Advanced Information Systems Engineering (CAiSE 2016). 129–
                                                                           136.
the quality of the translated SQL queries. Furthermore, we will       [21] Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel, and Michael
investigate the performance of different implementations of rela-          Schmidt. 2011. FedX: Optimization Techniques for Federated Query Pro-
tional databases in order to gain a deeper understanding of why            cessing on Linked Data. In The Semantic Web – ISWC 2011. ISWC 2011. Lecture
                                                                           Notes in Computer Science, Vol. 7031. Springer, Berlin, Heidelberg, 601–616.
filter expressions seem to perform better at query engine level in    [22] María-Esther Vidal, Edna Ruckhaus, Tomas Lampo, Amadís Martínez, Javier
most cases even though the intermediate results are larger in that         Sierra, and Axel Polleres. 2010. Efficiently Joining Group Patterns in SPARQL
case. Additionally, studying different kinds of query decompo-             Queries. In The Semantic Web: Research and Applications, Lora Aroyo, Grig-
                                                                           oris Antoniou, Eero Hyvönen, Annette ten Teije, Heiner Stuckenschmidt,
sition (e.g., triple-based instead of star-shaped sub-queries) and         Liliana Cabral, and Tania Tudorache (Eds.). Springer Berlin Heidelberg, Berlin,
not normalized tables is part of our plans.                                Heidelberg, 228–242.
                                                                      [23] Gio Wiederhold. 1992. Mediators in the Architecture of Future Information
                                                                           Systems. IEEE Computer 25, 3 (March 1992), 38–49.
ACKNOWLEDGMENTS                                                       [24] Vladimir Zadorozhny, Louiqa Raschid, Maria Esther Vidal, Tolga Urhan, and
                                                                           Laura Bright. 2002. Efficient Evaluation of Queries in a Mediator for Web-
This work has been partially supported by the EU H2020 RIA                 Sources (SIGMOD ’02). Association for Computing Machinery, New York, NY,
funded projects QualiChain (No 822404) and iASiS (No 727658).              USA, 85–96.