Exploiting Order Dependencies on Primary Keys for Optimization Michał Chromiak? , Piotr Wiśniewski? , Krzysztof Stencel ?? Institute of Informatics Maria Curie Skłodowska University Lublin, Poland? Faculty of Mathematics and Computer Science Nicolaus Copernicus University Toruń, Poland? Institute of Informatics University of Warsaw Warsaw, Poland?? Abstract. Functional dependencies have been used in query optimisa- tion for decades. Moreover, if two domains have a natural ordering of their elements, a functional dependency of them can potentially preserve these orderings, i.e. be a monotonic function. This monotonicity can be exploited by query optimizers. Recently, such monotonic functional de- pendencies have been termed order dependencies. In this paper we propose a query rewriting method based on order depen- dencies on primary keys. If an attribute used in the WHERE clause has an order dependency on the primary key, such a selection can be replaced by the corresponding condition on the primary key. We have implemented this optimisation method in the integration framework called the cuboid. It automates the integration of disparate databases according to the CQS model. Cuboids facilitate injecting dependencies and utilizing them in query optimization. 1 Introduction Optimisation methods that exploit functional dependencies have already been know for decades. Most of these methods base on the injectivity of the de- pendency function or its lack. Other properties of this functions are disused. However, if the domain of such a function is ordered and the function itself can preserve this order (i.e. be monotonic). This possibility have been discovered by Jarek Gryz [1–4]. He noted that date columns usually monotonic functions of artificial primary keys. The article [1] proposes a simple method based on this observation. The resulting speedup of query execution ranged from 20% to 50%. ? email: mchromiak@umcs.pl ? email: pikonrad@mat.uni.torun.pl ?? email: stencel@mimuw.edu.pl 2 Following papers [2–4] abstract so-called order dependencies and present their proof theory similar to Armstrong’s axioms. Optimisation methods invented by Jarek Gryz and his team are implemented inside IBM DB2 in one of its experimental branches. It allows hoping that soon these methods will be available to the user community. However, users of other DBMSs cannot access them. Therefore, in this paper we struggle to implement similar optimisation mechanisms outside of a specific database system. Our ex- perience [5–9] proves that a middleware is a perfect place to do this. It allows e.g. avoiding a dependency of a particular database vendor. In this paper, we use a cuboid as such a middleware. [10, 11] The paper is organized as follows. Section section 2 presents a motivating example that shows optimisation potential values vested in order dependencies. Section section 3 describes the proposed query rewriting algorithm and justifies its semantic correctness. Section section 4 reminds the properties of the cuboid and portrays its potential usage to inject optimisations routines. Section section 5 shows experimental evaluation of the proposed optimisation method. Section section 6 concludes. Acknowledgment We would like to thank Jarek Gryz for his inspiring keynote talk on Order Dependencies at the conference BDAS 2014 in Ustroń. Contribution The contribution of this paper consists of (1) a query rewriting algorithm that employs order dependencies and (2) an idea to use the cuboid as a layer to inject optimisation algorithms and (3) its positive verification. Fig. 1. An example schema of a sales database 1 customers facts 1 products c_id f_id p_id cname date pname n caddress c_id n p_id aggrv aggrq 2 Motivating Example Assume a simple data warehouse of the schema presented on Figure fig. 1. Con- sider the query for the sales in a selected period show on Listing 1.1. If the column date has no index, this query will require a full scan of the fact table. 3 Listing 1.1. A query for sales in the indicated period SELECT cname , sum( aggrv ) , SUM( aggrq ) 1 FROM f a c t s JOIN c u s t o m e r s USING ( c i d ) 2 WHERE date between ’ 2008−12−13 ’ AND ’ 2008−12−15 ’ 3 GROUP BY c i d , cname 4 The column f id of the table facts is its primary key. Therefore, all other columns of this tables are functionally dependent on it. The column date is this a function d : INT 7→ DATES . The implementation of such an artificial key is usually based on a sequence generator. Moreover, it is easy to assure that facts on sales from a particular day are recorded after all sales from the previous day. Therefore, we can assume that that d : INT 7→ DATES is non-decreasing. If we assume that: xmin = min{x; d(x) = ’2008-12-13’} xmax = max {x; d(x) = ’2008-12-15’} the query from Listing listing 1.1 is equivalent to the query shown on Listing listing 1.2. We have rewritten the condition of the WHERE clause. Listing 1.2. A rewritten query for sales in the indicated period SELECT cname , sum( aggrv ) , SUM( aggrq ) 1 FROM f a c t s JOIN c u s t o m e r s USING ( c i d ) 2 WHERE f i d between xmin AND xmax 3 GROUP BY c i d , cname 4 The query of Listing 1.2 will be executed using the range index on the primary key provided it is enough selective. Thus, its running time will be significantly shorter than that of the query from Listing listing 1.1. Moreover, the monotonic- ity of the function d allows efficient computing of xmin and xmax using the binary search. The experiments we have conducted prove that the overhead caused by this binary search is notably smaller that the time saved by executing the opti- mized version of the example query. This observations have led us to a rewrite algorithms that implements the idea presented above. 3 Query rewriting algorithm 3.1 Order Dependencies Assume a table T with its primary key P and remaining attributes {A1 , . . . , An }. Since P is the primary key of T there exists functions f1 , . . . , fn such that each tuple (p, a1 , ..., an ) of the table T cane be expressed as (p, f1 (p), . . . , fn (p)). The existence of the functions f1 , . . . , fn validate the functional dependencies of the column A1 , . . . , An on the primary key P . Assume that the domains of the columns P and Ai for a given i ∈ {1, 2, . . . , n} are linearly ordered sets. The functional dependency between P and Ai will be called an order Dependencies, if the function fi is monotonic. Moreover, an 4 important property of fi is whether is increasing, non-decreasing, non-increasing or decreasing. Such dependencies are initially called monotonic dependencies [1]. Then, their inventors coin the name order dependencies. The motivating example from Section 2 is based on such a dependency between the primary key f id and the column date. 3.2 Query rewriting The goal of the algorithm is to replace range conditions on non-indexed columns to corresponding range search on usually indexed primary key. The algorithm is aware of the schema, functional and order dependencies. Its version presented is this paper is able to rewrite SPJ queries (select-project-joins) and grouping-and- aggregate queries. The input of the algorithm is a query of the form portrayed on Listing1.3: Listing 1.3. Initial form of the query to be possibly rewritten SELECT . . . 1 FROM T JOIN T1 ON (T . f 1 k = T1 . pk ) 2 JOIN T2 ON (T . f 2 k = T2 . pk ) 3 .... 4 WHERE T . Ai BETWEEN a1 AND a2 5 GROUP BY . . . . . . 6 We also allow WHERE clauses with equality and inequality. We than convert them to atomic formulae based on BETWEEN using the same value or the data type margin (“infinity”) values. The condition WHERE T.Ai = a is the converted to WHERE T.Ai a BETWEEN a. In the first step we identify the fact table. We analyze the conditions in the JOIN ... ON clauses. The fact table connects other tables by foreign keys while its primary key is not connected by any other foreign key. In a query of the form shown on Listing 1.3 the fact table is denoted by T . If a query contains a string of dependencies foreign-primary key (e.g. in a snowflake schema), the algorithm will also do. However, if it encounters a cycle, it will stop processing and return the original query. The second step of the algorithm consists in checking whether (1) the WHERE clause references a column of the identified fact table and (2) this column has an order dependency of the primary key. In the third step, if the function fi is non-decreasing, we will search for values pmin and pmax such that: pmin = min{p; fi (p) = a1}, pmax = max {p; fi (p) = a2} (1) Analogously, if this function in non-increasing, the algorithm will compute values pmin and pmax such that: pmin = min{p; fi (p) = a2}, pmax = max {p; fi (p) = a1} (2) 5 Eventually, the algorithm concludes replacing the WHERE with: WHERE T.P BETWEEN pmin AND pmax Since the function fi is monotonic, the computation of pmin and pmax can be computed efficiently, e.g. using the binary search. 3.3 Remarks on the Implementation The algorithm is implemented in a middleware, i.e. outside of the database system. The computation of pmin and pmax that satisfy conditions (1) and (2) can be done in at least two ways. Both are based on the binary search. Firstly, we can send a series of queries in the course of the binary search. Its advantage is its inherent simplicity and the lack of any additional database object required. However, it causes numerous communication round trips with the database systems. Secondly, we can install appropriate stored procedures on the database side. We have decided to implement the binary search exactly this way. When the op- timizer on the middleware side is informed on the order dependency between the primary key and the column date, it will generate and install two stored func- tions. One of them shown on Listing 1.4 finds minimal f id for a given date. An analogous function get max fid by date(DATE) that computes maximal f id is also needed. Listing 1.4. A function that finds the minimal f id for a given date CREATE OR REPLACE FUNCTION g e t m i n f i n d b y d a t e ( 1 DF DATE 2 ) RETURNS integer AS $$ 3 DECLARE 4 F INTEGER; 5 Z INTEGER; 6 S INTEGER; 7 D DATE; 8 BEGIN 9 SELECT MAX ( f i d ) INTO Z FROM f a c t s ; 10 S=1; 11 WHILE S1 LOOP 16 S=S / 2 ; 17 SELECT date into D from f a c t s where f i d = F − S ; 18 IF D>= DF THEN 19 F=F − S ; 20 END IF ; 21 6 22 END LOOP; 23 RETURN F ; 24 END; 25 $$ LANGUAGE p l p g s q l 26 For the sake of readability we removed error handling code from the function get min... These errors may be caused by gaps in the numbering stored in the column f id. Using this function (and its twin get max...) the optimizer will first issue queries for corresponding margin values of f id. Then, it will put the collected parameters as values of bind variables in the modified query. 4 Cuboid as the linkup data structure Cuboid is a form of central, master metadata repository. It is responsible for storing contributory meta information about constituent data sources. Each in- tegrated data source must first be a subject of a registration procedure (see fig. 2). To register a data source at the Cuboid each data source needs a dedi- Fig. 2. General architecture for data access 5. iDAO 6. REST Call JSON 9. RS Cuboid 4. Con tribu ADAPTER View tory 3. 8. RS Contributory View 7. SQL MEDIATOR 1. 2. FACTS META DATA cated mediator to extract from the data source its most informative metadata about schemas, entities and their detailed description. Among those metadata mediator also places adequate queries. Those queries are native queries consid- ering particular data source. Each query is responsible for storing information 7 about particular part of the schema. The decision about which part of the data is going to be covered with the queries’ result sets is for the first time made at mediator configuration, prior to its registration in Cuboid. Thus, during medi- ator initialization (ie. metadata collecting from data source) mediator is aware of the data sets that needs to be covered with queries. The metadata collected from data source can further be modified during the mediator to Cuboid chatter. At the Cuboid, the metadata, in form of a contributory view provided during mediator registration, is stored and used for building of the global view. The global view is configured and build by designer at the Cubiod site. Prepared global view is then made available to the Adapter instance in form of interopera- ble Data Access Objects (iDAO). Those objects are designed to cover each data source contact details and its requested metadata with native queries. Each time client requests from Adapter one of global views1 , Adapter reaches for requested global view from Cuboid and unmarshalls out of it native queries, together with contact details to the data source of their origin. Using the con- tact details Adapter sends native queries to original data sources and receives requested result sets. This is happening at the lowest level - ie. JDBC. Now the result sets are being composed together based on global view. When the global view is ready in materialized form, it shall be made available to the client in unified way - ie. in form of REST API. The entire process involves complex metamodel for collecting and transforming contributory and global metadata. This information has been discussed with de- tails and examples in [11, 10]. The discussed optimization method in form of Order Dependencies (OD) can be easily applied with use of Cuboid architecture. Without need to interfere with database optimization engines we will be able to rewrite queries stored at the site of Cuboid. 4.1 Unified data access interface As depict in fig. 2, client calls for the optimized resource are supposed to be commenced using REST API. The construct of the client REST API includes non optimized and optimized version of the query. While requesting client will use the REST API to define whether the query response ie. result set, is going to be processed using non optimized version of the query or optimized. We have prepared four implementations for data access layers. Two of them - using JdbcTemplate and SimpleJdbcCall- are Spring Framework based and third is pure JDBC. The final method gets the pmin and pmax hard coded. This is to compare the time of pmin and pmax retrieval and overhead that is brought by each of the three remaining methods. The REST API is designed as follows. To retrieve unoptimized query answer the request URL should look like this: http://localhost:8080/DIAS/rest/dbs/facts/2008-01-01:2008-01-02 1 The Cuboid can store many arbitrarily customized global views depending on de- signer requirements. 8 Now, to request for an optimized query depending on query commuting mecha- nism the URL would change to: http://localhost:8080/DIAS/rest/dbs/facts/2008-01-01:2008-01-02/opti/X Where X stands for the query commit method number. The X values has been assigned as follows: 1. Spring simpleJdbcCall (stored functions) 2. Spring JDBCTemplate call statement (stored functions) 3. pure JDBC connection (stored functions) 4. Spring JdbcTemplate with (sub-queries rewrite) 5. hard coded pmin, pmax values This way we can get the necessary optimization method in simple and straight- forward manner. We will present the results in following section. 5 Experiments Let us first describe the testing environment. The tests has been performed using the following hardware: CPU Intel Core i7-3612QM CPU @ 2.10 GHz x 8 RAM 15,6 GiB Disk SAMSUNG SSD PM830 2.5” 7mm 512GB OS Ubuntu 14.04 LTS Kernel 3.13.0-30-generic Arch. x86 64 GNU/Linux Table 1: Hardware configuration used for tests. The procedure was to measure response times for the REST client calls to optimized and unoptimized queries. This means a testing REST client called REST API that has used underneath optimized or non-optimized query for result set retrieval. The tests has been performed using a the following software: Java java version 1.7 60 Java(TM) SE Runtime Environment (build 1.7.0 60-b19) Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode) REST Testing Client ApacheBench, Version 2.3 Http Server Apache Tomcat/6.0.29 Table 2: Software used in testing process. The test cases assumed two optimization methods. One was to rewrite query with substitution of WHERE clause with two stored function results. For com- parison reasons, the second case (fifth method) assumed replacing the stored 9 functions with simple sub queries to achieve the same goal as in the firs case. Namely: Listing 1.5. Simple rewrite with sub-queries SELECT f i d , sum( aggrv ) , 1 FROM f a c t s 2 WHERE f d a t e BETWEEN ( s e l e c t min( f i d ) from f a c t s 3 where f d a t e >= x ) 4 AND ( s e l e c t max( f i d ) from f a c t s 5 where f d a t e <= y ) 6 GROUP BY c i d ; 7 All tested use cases were conducted against the same request parameters and source data. The queried data range was between 2008-01-01 and 2008-01-02. The result size was 31,546 MB. Each method has been tested 50 times. Measuring database response times for pmin and pmax was based on Java’s currentTimeMilis() method from java.lang.System 2 . The test results has been placed in table 3. Activity Call Method Document 31.546 Length [MB] Method Name simpleJdbcCall JdbcTemplate JDBC Subquery Hard Coded non-opti Stored Functions / pmin pmax pmin pmax pmin pmax pmin pmax 0 Subqueries [ms] 13 14 7 6 5 6 7524 15493 Avg.Time per request 44.010 29.762 27.196 23038.530 16.431 87681.945 [ms] Table 3: Test results for 50 request trial. The results has clearly shown that rewriting the WHERE clause boosts the target query almost 4 times. This is while only modifying the WHERE clause with subqueries enabling primary key in role of index. This gives us the idea of how order dependency based query can be effective. Both of the queries do operate on f date column that has not even been indexed. The result would be greatly better if only we would place an index on f date column. The hard coded column values for pmin and pmax are presented to compare the time performance of the query itself without rewriting process. Three remaining JDBC-based use cases for (pmin,pmax) retrieval, are at worst three times slower than the hard coded (pmin,pmax) pair. 2 The detailed discussion for choosing this method has been conducted in [12] 10 In general we have gained a speed boost form 87.681 seconds to only 0.027 seconds, which reduced the time of result retrieval for approx. 99,96%. Such gain for the discussed use case, is achieved with best - pure JDBC - method, comparing to non optimized query. 6 Conclusions In this paper we have analyzed so called order dependencies and their optimi- sation potential when applied at the middleware level. An order dependency is a functional dependency such that its induced function is monotonic with re- spect to linear orderings of domains. We have proposed an optimisation method that exploits order dependencies on primary keys. We have prepared its proof- of-concept implementation on the middleware level (the cuboid). Middleware has amounted to a feasible place for such optimisations and made such a so- lution vendor neutral. We have also performed experimental evaluation of this implementation and got promising results. References 1. Szlichta, J., Godfrey, P., Gryz, J., Ma, W., Pawluk, P., Zuzarte, C.: Queries on dates: fast yet not blind. In Ailamaki, A., Amer-Yahia, S., Patel, J.M., Risch, T., Senellart, P., Stoyanovich, J., eds.: EDBT, ACM (2011) 497–502 2. Szlichta, J., Godfrey, P., Gryz, J.: Fundamentals of order dependencies. PVLDB 5 (2012) 1220–1231 3. Szlichta, J., Godfrey, P., Gryz, J., Zuzarte, C.: Expressiveness and complexity of order dependencies. PVLDB 6 (2013) 1858–1869 4. Szlichta, J., Godfrey, P., Gryz, J., Ma, W., Qiu, W., Zuzarte, C.: Business- intelligence queries with order dependencies in db2. In Amer-Yahia, S., Christophides, V., Kementsietsidis, A., Garofalakis, M.N., Idreos, S., Leroy, V., eds.: EDBT, OpenProceedings.org (2014) 750–761 5. Gawarkiewicz, M., Wiśniewski, P.: Partial aggregation using Hibernate. In Kim, T.H., Adeli, H., Slezak, D., Sandnes, F.E., Song, X., Chung, K.I., Arnett, K.P., eds.: FGIT. Volume 7105 of Lecture Notes in Computer Science., Springer (2011) 90–99 6. Wiśniewski, P., Szumowska, A., Burzańska, M., Boniewicz, A.: Hibernate the recursive queries - defining the recursive queries using Hibernate ORM. In Eder, J., Bieliková, M., Tjoa, A.M., eds.: ADBIS (2). Volume 789 of CEUR Workshop Proceedings., CEUR-WS.org (2011) 190–199 7. Gawarkiewicz, M., Wiśniewski, P., Stencel, K.: Enhanced segment trees in object- relational mapping. In: Proceedings of the 6th Balkan Conference in Informatics. BCI ’13, New York, NY, USA, ACM (2013) 122–128 8. Wiśniewski, P., Stencel, K.: Query rewriting based on meta-granular aggregation. In Szczuka, M., Czaja, L., Kacprzak, M., eds.: Proceedings of the 22nd Interna- tional Workshop on Concurrency, Specification and Programming (CS&P 2013), Białystok, Białystok University of Technology (2013) 457–468 11 9. Boniewicz, A., Wisniewski, P., Stencel, K.: On materializing paths for faster recur- sive querying. In Catania, B., Cerquitelli, T., Chiusano, S., Guerrini, G., Kämpf, M., Kemper, A., Novikov, B., Palpanas, T., Pokorný, J., Vakali, A., eds.: AD- BIS (2). Volume 241 of Advances in Intelligent Systems and Computing., Springer (2013) 105–112 10. Chromiak, M., Stencel, K.: The linkup data structure for heterogeneous data integration platform. In Kim, T.H., Lee, Y.H., Fang, W.C., eds.: FGIT. Volume 7709 of Lecture Notes in Computer Science., Springer (2012) 263–274 11. Chromiak, M., Stencel, K.: A data model for heterogeneous data integration ar- chitecture. In Kozielski, S., Mrozek, D., Kasprowski, P., Malysiak-Mrozek, B., Kostrzewa, D., eds.: BDAS. Volume 424 of Communications in Computer and In- formation Science., Springer (2014) 547–556 12. Chromiak, M., Lojewski, Z.: Stream security particularities in java. Ann. UMCS, Inf. 8 (2008) 5–13