A Quantitative Study of Two Matrix Clustering Algorithms Alexander Slesarev Viacheslav Galaktionov1, 2 Nikita Bobrov1, 2 1 1 Saint-Petersburg University, Russia JetBrains Research, JetBrains Research, 2 Saint-Petersburg, Russia Saint-Petersburg State University, Russia 2 Saint-Petersburg State University, Russia alexander.g.slesarev@gmail.com Saint-Petersburg, Russia Saint-Petersburg, Russia viacheslav.galaktionov@gmail.com nikita.v.bobrov@gmail.com George Chernishev1, 2 1 JetBrains Research, 2 Saint-Petersburg State University, Russia Saint-Petersburg, Russia g.chernyshev@spbu.ru Abstract—Matrix clustering is a technique which permutes query all needed attributes are allocated into a single frag- rows and columns of a matrix to form densely packed regions. ment, and this fragment contains no extra attributes. In this It originated in the 70’s and initially was used for various object case, one can roughly estimate that number of rows × grouping problems, such as machine-component grouping. The database community noticed these algorithms and successfully extra attributes lengths bytes can be saved during the data applied them to the vertical partitioning problem. Recently, reading phase in case of a slotted page data layout [2]. there has been a resurgence of interest in these algorithms. However, if there is a query that requires attributes from Nowadays, they are being considered for dynamic (on-line) two or more fragments, then its performance may suffer due to vertical partitioning and tuning of multistores. the record reconstruction costs. Data modification operations In our previous papers we have described our project aimed at studing the applicability of recent matrix clustering algorithms (inserts, deletes, and updates) complicate things further since for the vertical partitioning problem. We have presented our they involve all attributes of a record and thus, all fragments evaluation approach and reported results concerning several of should be modified. The impact of additional disk seeks on a these algorithms. Our idea was to evaluate them directly using hard drive may be so large that it can make the partitioning the PostgreSQL database. Previous studies have found that these scheme impractical. algorithms can be of use if they employ the attribute replication strategy. In this paper, we continue our investigation and consider Due to all these facts, there is still no support of fully- a novel algorithm of this class. Its distinctive feature is that automatic vertical partitioning in industrial database systems. it performs attribute replication during the branch and bound Moreover, unlike the horizontal, vertical partitioning is not search. We compare it with the best one of the earlier algorithms supported in SQL DDL: e.g., in PostgreSQL it is possible using both real and synthetic workloads. to define horizontal fragments using the “PARTITION BY” Our experiments have demonstrated that the novel algorithm produces slightly worse configurations (about 10%), but its run clause for a “CREATE TABLE” statement. times are significantly better and are almost independent of the Nevertheless, there are multiple semi-automatic stand-alone cohesion parameter. tools (“advisors”, see surveys [3], [4]) for this task. All of Index Terms—databases, database tuning, physical design, them recommend beneficial vertical partitioning schemes for a vertical partitioning, experimentation, matrix clustering, frag- specified workload (queries) and let the database administrator mentation. decide whether to implement them or not. The reason for the limited success of these tools (the over- I. I NTRODUCTION whelming majority of them are academic research prototypes Vertical partitioning is a technique used to speed up query and not industrial products) is that finding an optimal solution processing in databases. Its core idea is dividing a table into is an NP-hard problem for many different formulations [5]– fragments which contain only a subset of attributes. In order [7]. Another well-known fact is that the number of different to ensure that the database will not undergo semantic changes, vertical partitioning schemes for a single table is equal to the following rules of vertical partitioning are used [1]: the N th Bell number, where N is the number of attributes completeness, reconstruction, and disjointness. Sometimes the [8]. Nevertheless, due to the interest of both industrial and disjointness rule is relaxed. In this case, it is said that vertical academic communities, the development of such advisors partitioning is performed with attribute replication. continues. The speedup comes from the fact that some queries would In the core of such a system lies an algorithm that traverses have to read less data. Indeed, suppose that for a given the partitioning space and selects a beneficial scheme. There are two classes of algorithms for this task: cost-based and 2) Attribute affinity is calculated for all pairs of at- heuristic. The former employ some kind of a cost-based model tributes and an Attribute Affinity Matrix (AAM) is to evaluate the quality of a given partitioning scheme in terms constructed. of query run times, required space, and other metrics. The 3) A special algorithm for row and column permutation latter proposes some kind of procedure to generate a “good” is applied to the AAM. Afterwards, “dense” regions scheme. Usually, some considerations are presented as to why are extracted and used to define resulting partitions. it is likely to generate a beneficial partitioning scheme, but not Studies employing the matrix clustering approach (and a strict proof. in particular, the ones considered in our paper) permute The heuristic approach was very popular in the 70’s and AUMs, but not AAMs. 80’s, but later was abandoned in favour of the cost-based • Graph approaches [5], [25]–[28]. Similarly to the pre- one. Nowadays, there is a resurgence of interest in heuristic vious type, these approaches start with a workload and approaches due to the appearance of novel application areas: use it to construct an AAM. However, in this case dynamization of vertical partitioning [9]–[12], tuning of mul- the AAM is considered as an adjacency matrix of an tistores [13], big data applications or any other cases featuring undirected weighted graph, where the nodes are attributes limited resources. and the edge weights show the affinity for a given pair In our previous studies [14]–[16] we have described our of attributes. Finally, this graph is used to search for project that aims to study the applicability of several re- special structures which will be used to define resulting cently developed matrix clustering algorithms. Our project is partitions. There are many approaches, e.g. Kruskal-like motivated by the fact that the authors of these algorithms algorithms or cutting the Hamiltonian way. have not evaluated their performance (run times, quality) • Data mining approaches [29]–[31]. In this type of ap- using a DBMS and a workload. To address this, we have proach, association rule mining is used to derive vertical constructed a framework for evaluating such algorithms that fragments. The workload is considered as a transaction uses PostgreSQL. Then we have evaluated a number of these set, and the rules use sets of attributes as items. This algorithms [17]–[19] using the TPC-H benchmark. In this group of vertical partitioning algorithms is relatively new, paper, we continue our research and consider the most recent so existing algorithms for association rule search are algorithm of this type [20]. frequently used. For example, a popular choice is to adapt The rest of this paper is organized as follows. In Section II Apriori [32] or FP-Max algorithms. we provide a short introduction into the subject and describe III. M ATRIX C LUSTERING A LGORITHMS existing types of heuristic approaches. Next, in Section III we introduce matrix clustering algorithms and provide a A. Basics description of the considered algorithm. Section IV describes The general scheme of this approach is as follows [14], [15]: our experimental framework, setup, and the experiments. The • Construct an Attribute Usage Matrix (AUM) from the results of evaluation are discussed in Section V, threats to va- workload. The matrix is defined as follows: lidity of this study are presented in Section VI and Section VIII ( concludes this paper. 1, query i uses attribute j Mij = 0, otherwise II. R ELATED W ORK • Cluster the AUM by permuting its rows and columns to obtain a block diagonal matrix. As it was stated in the Introduction, there are two types • Extract these blocks and use them to define the resulting of approaches to the vertical partitioning problem — cost- partitions. based and heuristic. Since this problem is almost 40 years old, and a lot of results have been accumulated, we will Some approaches do not operate on a 0-1 matrix. Instead, they only describe studies on heuristic algorithms in this section. modify matrix values to account for additional information More extensive surveys that examine cost-based approaches like query frequency, attribute size and so on. Let us consider as well can be found in references [4], [21]. Heuristic vertical an example. Suppose that there are six queries accessing six partitioning algorithms can be classified into the following attributes: major groups [14], [15]: q1: SELECT a FROM T WHERE a > 10; • Attribute affinity and matrix clustering approaches [17]– q2: SELECT b, f FROM T; [19], [22]–[24]. Attribute affinity is a measure which q3: SELECT a, c FROM T WHERE a = c; shows how frequently two attributes are requested to- q4: SELECT a FROM T WHERE a < 10; gether in a given workload. These approaches use it as q5: SELECT e FROM T; follows: q6: SELECT d, e FROM T WHERE d + e > 0; 1) A workload is used to construct an Attribute Usage The next step is the creation of an AUM using this work- Matrix (AUM), a special way to represent which load. The resulting matrix is shown in Figure 1a. After the attributes are used by each query of a workload. application of a matrix clustering algorithm, the reordered a b c d e f a c b f d e q1 1 0 0 0 0 0 q1 1 0 0 0 0 0 q2 0 1 0 0 0 1 q3 1 1 0 0 0 0 q3 1 0 1 0 0 0 q4 1 0 0 0 0 0 q4 1 0 0 0 0 0 q2 0 0 1 1 0 0 q5 0 0 0 0 1 0 q6 0 0 0 0 1 1 q6 0 0 0 1 1 0 q5 0 0 0 0 0 1 (a) AUM (b) Reordered AUM Fig. 1: Matrix clustering algorithm a b c d e f IV. E XPERIMENTS 1 1 1 0 0 0 A. Benchmarking 1 1 1 0 0 0 1 1 1 0 0 0 In our previous works we have developed a special pro- 1 0 0 1 1 0 totype for experimental evaluation of matrix clustering algo- 1 0 0 1 1 0 rithms. The idea of our approach is to directly check whether 1 0 0 0 0 1 the generated partitioning schemes help to improve query performance. For these purposes we employ the PostgreSQL Fig. 2: Non-decomposable matrix DBMS and several workloads, both real and synthetic. The architecture of our prototype is presented in Figure 3. It consists of the following modules: AUM (Figure 1b) is acquired. The resulting fragments are the • The parser reads the workload from a file. It extracts the following: (a, b), (b, f ), (d, e). queries and passes them to the executor, so that their However, not all matrices are fully decomposable. Consider execution times can be measured. It also constructs the the matrix presented in Figure 2. The first column obstructs AUM, which serves as input for the selected algorithm. the perfect decomposition into several clusters. In this case, the • The algorithm identifies clusters and passes that informa- algorithm should produce a decomposition which minimally tion to the partitioner to create corresponding temporary harms query processing and results in an overall performance tables. improvement. Matrix clustering algorithms employ different • The query rewriter also receives this information. It strategies to select such a decomposition. replaces the name of the original table with the ones that A systematic review of matrix clustering algorithms is were generated by the partitioner. presented in studies [14], [15]. Here, we will consider only • The partitioner generates new names and sends partition- the recent approaches. ing commands to the database. The exact commands are SELECT INTO and ALTER TABLE. The latter lets it B. Recent Advances transfer primary keys. Within our project, we study a series of works by Chun- • The executor accepts queries and sends them to Post- Hung Cheng et al [17]–[20]. These algorithms employ a greSQL to measure the time of execution. branch and bound search that tries to find submatrices that conform to specific conditions. Their input is the threshold B. Experimental Setup and Evaluation Procedure (target cohesion), which is defined as the share of 1’s in the In our experiments, we have used the following hardware resulting matrices. and software setup: In this study we are interested in two algorithms — A09 [19] • Inspiron 15 7000 Gaming (0798), 8GiB, Intel(R) and A11 [20]. Core(TM) i5-7300HQ CPU @ 2.50GHz, TOSHIBA 1TB The A09 algorithm comes with three different strategies MQ02ABD1 that define the treatment of intersubmatrix attributes (the ones • Ubuntu 18.10, PostgreSQL 11.1, gcc 8.2.0 that were marked as obstacles to decomposition) — nearest, separate, and replicate. In the first one such attribute goes to Data for quality-related graphs was obtained by running the nearest submatrix, in the second all such attributes are 10 invocations of the respective algorithm and averaging the assigned to a dedicated submatrix, and the last one replicates result. We deemed a single run sufficient for run time graphs, the attribute into each submatrix that requires it. Note that the since even one invocation can require up to two hours. strategy is applied after the clustering is done. In order to ensure maximum quality of experiments, several The A11 algorithm has a different idea. If during branch measures were taken: and bound traversal the algorithm encounters such an attribute, 1) We eliminated data caching for both operating sys- then it replicates it and tries to decompose the matrices further. tem caches and PostgreSQL caches. For this, we vpart workload parser executor rewriter PostgreSQL algorithm partitioner Fig. 3: The architecture of our approach have restarted PostgreSQL and dropped the operat- algorithm, and run times determine its suitability for on-line ing system caches before running each query. Oper- vertical partitioning. ating system caches were dropped by writing “3” to To evaluate the quality of partitioning, we have compared /proc/sys/vm/drop_caches. algorithm A11 to the best of other matrix clustering algorithms 2) Next, we manually checked plans for each query and (according to our previous studies [14], [15]) — A09. This noticed that some queries may have different scan algorithm has three different strategies that were described operator implementations depending on the table. Fre- earlier. In our experiments we compare the quality of resulting quently, a query on a partitioned table did not have partitions of all three of them with the ones obtained by A11. sequential scan, but rather parallel. To handle this, To conduct experiments we have employed the “Star” table we have restricted the query optimizer to use only of the SDSS (Sloan Digital Sky Survey) dataset. The SDSS is a sequential scans by issuing the following command set publicly available astronomical database that contains detailed max_parallel_workers_per_gather to 0;. three-dimensional maps of the Universe. It is frequently used To ensure that no hidden caching or other unaccounted as a testing dataset in various data partitioning studies. We processes happen, we have designed the following simple have used the following pack: SDSS-IV Data Release 14, criterion. Suppose that we have a set of queries that involve 2016. Its “Star” table contains 509 attributes and 492515 only a single table and are essentially scans without complex records. data processing. Initially, we run these queries on the original To obtain representative workloads, we have also used the table and record their run times. Then, for every query we SDSS dataset. In SDSS, it is possible to see what queries users designate a table that will contain all attributes necessary to have issued via a special website1 . Using this website, we have evaluate it. Thus, no joins are needed. At the same time, for selected 8 queries from the workload that address solely this some queries, the tables assigned to them will also contain table. extra attributes. Therefore, some tables may serve more than In our first experiment we have varied the cohesion measure one query. Then we run each query on corresponding table (a ratio of 1 in the resulting matrices) for three strategies of and record its run time. Eventually the following two values A09 and compared it with A11. The results are presented in should be approximately equal: Figure 5a. On this chart, each bar represents the performance P 1) Pqi ∈Queries (size(T )/time(qi )) of an individual algorithm with the corresponding strategy. There also are two horizontal lines: not clustered and pinched 2) qi ∈Queries (size(table(qi ))/time(qi )) not clustered. The first one is the workload run time on In these equations size(T ) is the size of a table in bytes. the original, unmodified table. The second is the workload Functions time(qi ) and table(qi ) return the time it took to run run time on the cleaned up original table, containing only a query qi and a table that corresponds to query qi . 30 attributes that are referenced in the workload. In this In other words, the idea is to check that workload run times experiment we varied the cohesion measure parameter. depend solely on the size of the table. To evaluate algorithm run times we used both SDSS and Having applied all the aforementioned measures, we have synthetic (generated) tests. The results of the SDSS tests are obtained the difference of about 10 − 15% in these values. We presented in Figure 5b. Here, we also vary cohesion for the deemed such a result acceptable and decided to start evaluating same four algorithms. the algorithms. In the synthetic tests, we have tried to study the scalability Finally, we must note that our matrix clustering algorithms of the A11 algorithm in terms of run times. For this, we are parallel [16]. However, in this paper we did not consider have generated a set of random 0-1 matrices with different them and instead employed their sequential versions. probabilities of having 1 in each position (cohesion). Then, we have examined the dependency of the run time on the C. Experiments size of the matrix. The specified threshold was set to 0.9 In our study, we have addressed two applicability aspects of in all experiments. If the threshold is more than the used matrix clustering algorithms: quality of generated partitioning cohesion, then a solution (the original matrix) is found almost schemes and algorithm run times. Both of them are important since quality is the primary characteristic of any partitioning 1 http://skyserver.sdss.org/log/en/traffic/ immediately. We also set a time limit of 2 hours, after reaching estingly, for high cohesion values A11 produces more which the algorithm is stopped. fragments, but does not help to improve performance. We started with square matrices (see Figure 4a), then VI. T HREATS TO VALIDITY separately evaluated the influence of the number of columns (Figure 4b) and the number of rows (Figure 4c) on the We have identified a number of issues that should be kept algorithm run time. In the last two experiments we fixed one in mind while discussing our results: dimension to 20 and increased the other up until the time limit 1) First of all, the policy of database restarts after each was reached. query may be unfair. In real-life scenarios where these Finally, we have looked into the storage requirements of algorithms will hypothetically be used, database caching these algorithms (Figure 6). Here, we show the required disk would be present. However, such scenarios are nearly space for each generated configuration. On top of each bar, impossible to simulate since they require hundreds or an overall number of fragments is shown. We have also thousands of real queries and more important, their divided each bar into parts representing the sizes of resulting frequencies and arrival patterns. fragments. The sizes of original and pinched tables are shown 2) Next, the SDSS dataset is only a single dataset, so by horizontal lines. the results may differ on other datasets. Moreover, it is a scientific dataset used by the astronomy research V. R ESULTS AND D ISCUSSION community and therefore, its queries and data may not be comparable to the industrial ones. Nevertheless, it • All of the algorithms produced partitioning schemes that is popular in the vertical partitioning community (e.g. provide better performance than the original and pinched see [33]–[36]) due to the lack of industrial schema-less tables, regardless of the cohesion value. benchmarks. • The quality of produced solutions heavily depends on the 3) There may be errors in our implementation of these cohesion value. Starting with the cohesion value of 0.8 algorithms. In order to mitigate this threat we have tested results of A11 start to rival the results of the best A09 our implementation on example matrices presented in strategies. However, up to this point, the clear winner is the considered papers and ensured that the resulting par- A09 with replication. titioned matrices are the same. Furthermore, to address • Overall, the best result was produced by a replicating this issue we plan to release the source code on GitHub. variant of A09 (3.358, cohesion=0.55), with a separate 4) Contemporary DBMSes are very complex systems in variant of A09 being the fourth (3.500, cohesion=0.8), which minimal changes to inputs may drastically affect and A11 being the fifth (3.553, cohesion=0.8). performance. Therefore, during experimental evaluation • It is interesting to note that there is some sort of a global performance may change not due to vertical partitioning, minimum at the 0.7 point. Here, the total time over all but due to other events, such as query optimizer selecting algorithms is minimal in the whole cohesion range. a completely different plan. To counter this we have • With the SDSS workload algorithm A11 works almost carefully checked query execution plans to find and ten times faster than A09, regardless of the employed eliminate any inconsistencies. We have also devised a strategy. Note that increasing the target threshold also criterion that allows to detect such inconsistencies in increases run times. For A09, run times increased from simple cases. less than 1 second to almost 140 seconds, while A11 took 5) We have considered a relatively simple workload which 0.06 and 0.119 seconds respectively. involves only a single table. Having to perform extra • The scalability of A11 is not as good as desired. However, joins in addition to the partitioning-induced ones may two points should be taken into account. Firstly, run significantly decrease overall performance and thus, the times depend on the number of referenced attributes desirability of vertical partitioning. However, joins with in the workload, not on the total number. Secondly, in other tables are extremely rarely considered in litera- our scalability experiments we used an extremely large ture [3]: only a handful of studies address them. threshold of the cohesion — 0.9. Finally, the author [20] noted that it is possible to interrupt the algorithm earlier VII. ACKNOWLEDGEMENTS while still obtaining decent results. Therefore, further We would like to thank Anna Smirnova for contributing to studies are needed. the editing and proofreading process. • Increasing the number of attributes impacts run times more than increasing the number of queries. In two hours VIII. C ONCLUSION time it is possible to process either a 20 × 25 matrix or In this paper we have presented a quantitative study of a 205 × 20 one. two recent matrix clustering algorithms. We have studied their • The solutions produced by all algorithms require from output quality, run times, and storage requirements using both 1.5 to 2 times more disk space than the pinched table. synthetic and real datasets. Increasing the target threshold increases the number of Our evaluation has shown that for schema-less data all fragments and the overall required disk space. Inter- algorithms can produce a beneficial configuration, while a (a) A11 run times on a square matrix (b) Dependency of A11 run times on matrix width (c) Dependency of A11 run times on matrix height Fig. 4: Run times of the A11 matrix clustering algorithm, synthetic datasets. (a) Quality of partitioning (b) Algorithm run times Fig. 5: Performance of the A11 and A09 matrix clustering algorithms, SDSS datasets. Fig. 6: Storage requirements. replicating variant of A09 is 10% better than A11. However, [14] V. Galaktionov, G. Chernishev, B. Novikov, and D. Grigoriev, “Matrix A11 is significantly faster and more importantly, less impacted clustering algorithms for vertical partitioning problem: an initial performance study,” in Selected Papers of the XVIII International by the target threshold parameter. Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2016), Ershovo, Moscow Region, Russia, October 11-14, 2016., 2016, pp. 24–31. [Online]. Available: http://ceur- R EFERENCES ws.org/Vol-1752/paper05.pdf [15] V. Galaktionov, G. Chernishev, K. Smirnov, B. Novikov, and D. A. Grigoriev, “A study of several matrix-clustering vertical partitioning [1] M. T. Özsu and P. Valduriez, Principles of Distributed Database Systems, algorithms in a disk-based environment,” in Data Analytics and Man- 3rd ed. Springer Publishing Company, Incorporated, 2011. agement in Data Intensive Domains, L. Kalinichenko, S. O. Kuznetsov, [2] R. Ramakrishnan and J. Gehrke, Database Management Systems, 3rd ed. and Y. Manolopoulos, Eds. Cham: Springer International Publishing, New York, NY, USA: McGraw-Hill, Inc., 2003. 2017, pp. 163–177. [3] G. Chernishev, “A survey of dbms physical design approaches,” [16] V. Galaktionov, “Parallelization of matrix clustering algorithms SPIIRAS Proceedings, vol. 24, pp. 222–276, 2013. [Online]. Available: (accepted),” in Proceedings of the Sixth International Conference www.mathnet.ru/trspy580 on Informatics Problems (SPISOK 2016), 2016. [Online]. Available: [4] ——, “Vertical Partitioning in Relational DBMS,” 30 4 2015, http://spisok.math.spbu.ru/2016/txt/SPISOK-2016.pdf talk at the Moscow ACM SIGMOD chapter meeting; slides and [17] C.-H. Cheng, “A branch and bound clustering algorithm,” Systems, Man video: http://synthesis.ipi.ac.ru/sigmod/seminar/s20150430.html [Ac- and Cybernetics, IEEE Transactions on, vol. 25, no. 5, pp. 895–898, cessed: 2015 11 09]. 1995. [5] X. Lin, M. Orlowska, and Y. Zhang, “A graph based cluster approach for vertical partitioning in database design,” Data & Knowledge Engi- [18] C. Cheng, “Algorithms for vertical partitioning in database physical neering, vol. 11, no. 2, pp. 151–169, 1993. design,” Omega, vol. 22, no. 3, pp. 291–303, 1994. [6] D. Sacca and G. Wiederhold, “Database partitioning in a cluster of [19] C.-H. Cheng and J. Motwani, “An examination of cluster processors,” ACM Trans. Database Syst., vol. 10, pp. 29–56, 1985. identification-based algorithms for vertical partitions,” Int. J. Bus. Inf. Syst., vol. 4, no. 6, pp. 622–638, 2009. [Online]. Available: [7] P. M. G. Apers, “Data allocation in distributed database systems,” ACM http://dx.doi.org/10.1504/IJBIS.2009.026695 Trans. Database Syst., vol. 13, pp. 263–304, 1988. [Online]. Available: http://doi.acm.org/10.1145/44498.45063 [20] C.-H. Cheng, K.-F. Wong, and K.-H. Woo, “An improved branch- [8] M. Hammer and B. Niamir, “A heuristic approach to attribute and-bound clustering approach for data partitioning,” International partitioning,” in Proceedings of the 1979 ACM SIGMOD international Transactions in Operational Research, vol. 18, no. 2, pp. 231–255, 2011. conference on Management of data, ser. SIGMOD ’79. New [Online]. Available: http://dx.doi.org/10.1111/j.1475-3995.2010.00781.x York, NY, USA: ACM, 1979, pp. 93–101. [Online]. Available: [21] G. Chernishev, “A survey of DBMS physical design approaches http://doi.acm.org/10.1145/582095.582110 (in Russian),” Tr. St. Petersburg Inst. Infor. Avtom. Ross. Akad. [9] A. Jindal and J. Dittrich, “Relax and let the database do the partitioning Nauk SPIIRAN, vol. 24, pp. 222–276, 2013. [Online]. Available: online,” ser. Lecture Notes in Business Information Processing, 2012, www.mathnet.ru/trspy580 vol. 126, pp. 65–80. [22] N. Gorla and W. J. Boe, “Database operating efficiency [10] L. Li and L. Gruenwald, “Self-managing online partitioner for databases in fragmented databases in mainframe, mini, and mi- (smopd): A vertical database partitioning system with a fully automatic cro system environments,” Data & Knowledge Engineering, online approach,” ser. IDEAS ’13, 2013, pp. 168–173. vol. 5, no. 1, pp. 1–19, 1990. [Online]. Available: [11] L. Rodrı́guez and X. Li, “A dynamic vertical partitioning approach for http://www.sciencedirect.com/science/article/pii/0169023X9090030H distributed database system,” in Systems, Man, and Cybernetics (SMC), [23] J. A. Hoffer and D. G. Severance, “The use of cluster analysis in 2011 IEEE International Conference on, 2011, pp. 1853–1858. physical data base design,” ser. VLDB ’75, 1975, pp. 69–86. [Online]. [12] L. Rodrı́guez, X. Li, and P. Mejı́a-Alvarez, “An active system for Available: http://doi.acm.org/10.1145/1282480.1282486 dynamic vertical partitioning of relational databases,” ser. Lecture Notes [24] N. Bobrov, G. Chernishev, and B. Novikov, “Workload-independent in Computer Science. Springer Berlin Heidelberg, 2011, vol. 7095, pp. data-driven verticalpartitioning,” in New Trends in Databases and In- 273–284. formation Systems, M. Kirikova, K. Nørvåg, G. A. Papadopoulos, [13] J. LeFevre, J. Sankaranarayanan, H. Hacigumus, J. Tatemura, J. Gamper, R. Wrembel, J. Darmont, and S. Rizzi, Eds. Cham: Springer N. Polyzotis, and M. J. Carey, “Miso: Souping up big data query International Publishing, 2017, pp. 275–284. processing with a multistore system,” ser. SIGMOD ’14, 2014, pp. 1591– [25] C.-H. Cheng, W.-K. Lee, and K.-F. Wong, “A genetic algorithm- 1602. [Online]. Available: http://doi.acm.org/10.1145/2588555.2588568 based clustering approach for database partitioning,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, in large databases,” ser. VLDB ’94, 1994, pp. 487–499. vol. 32, no. 3, pp. 215–230, 2002. [33] S. Papadomanolakis and A. Ailamaki, “AutoPart: automating schema [26] S. B. Navathe and M. Ra, “Vertical partitioning for database design: a design for large scientific databases using data partitioning,” in Scientific graphical algorithm,” ser. SIGMOD ’89, 1989, pp. 440–450. and Statistical Database Management, 2004. Proceedings. 16th Inter- [27] J. Du, K. Barker, and R. Alhajj, “Attraction — a global affinity measure national Conference on, june 2004, pp. 383–392. for database vertical partitioning,” in ICWI. IADIS, 2003, pp. 538–548. [34] M. G. Ivanova, M. L. Kersten, N. J. Nes, and R. A. [28] J. H. Son and M.-H. Kim, “α-partitioning algorithm: Vertical partition- Gonçalves, “An architecture for recycling intermediates in a ing based on the fuzzy graph,” ser. DEXA ’01, 2001, pp. 537–546. column-store,” in Proceedings of the 35th SIGMOD international [Online]. Available: http://dl.acm.org/citation.cfm?id=648314.755837 conference on Management of data, ser. SIGMOD ’09. New [29] M. Bouakkaz, Y. Ouinten, and B. Ziani, “Vertical fragmentation of data York, NY, USA: ACM, 2009, pp. 309–320. [Online]. Available: warehouses using the FP-Max algorithm,” in Innovations in Information http://doi.acm.org/10.1145/1559845.1559879 Technology (IIT), 2012 International Conference on, march 2012, pp. [35] T. Malik, X. Wang, D. Dash, A. Chaudhary, A. Ailamaki, and R. Burns, 273–276. “Adaptive physical design for curated archives,” in Proceedings of the [30] N. Gorla and B. P. W. Yan, “Vertical fragmentation in databases using 21st International Conference on Scientific and Statistical Database data-mining technique,” in Database Technologies: Concepts, Method- Management, ser. SSDBM 2009. Berlin, Heidelberg: Springer-Verlag, ologies, Tools, and Applications, J. Erickson, Ed. IGI Global, 2009, 2009, pp. 148–166. [Online]. Available: http://dx.doi.org/10.1007/978- pp. 2543–2563. 3-642-02279-1 11 [31] L. Rodriguez and X. Li, “A support-based vertical partitioning method [36] I. Alagiannis, S. Idreos, and A. Ailamaki, “H2o: A hands-free adaptive for database design,” in Electrical Engineering Computing Science and store,” in Proceedings of the 2014 ACM SIGMOD International Automatic Control (CCE), 2011 8th International Conference on, oct. Conference on Management of Data, ser. SIGMOD ’14. New 2011, pp. 1–6. York, NY, USA: ACM, 2014, pp. 1103–1114. [Online]. Available: [32] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules http://doi.acm.org/10.1145/2588555.2610502