A Quantitative Study of Two Matrix Clustering
                        Algorithms
         Alexander Slesarev                       Viacheslav Galaktionov1, 2                           Nikita Bobrov1, 2
                                                     1                                             1
 Saint-Petersburg University, Russia             JetBrains Research,                          JetBrains Research,
                                     2
      Saint-Petersburg, Russia         Saint-Petersburg State University, Russia 2 Saint-Petersburg State University, Russia
  alexander.g.slesarev@gmail.com              Saint-Petersburg, Russia                    Saint-Petersburg, Russia
                                         viacheslav.galaktionov@gmail.com                nikita.v.bobrov@gmail.com

                                                    George Chernishev1, 2
                                                     1
                                                       JetBrains Research,
                                         2
                                             Saint-Petersburg State University, Russia
                                                    Saint-Petersburg, Russia
                                                     g.chernyshev@spbu.ru


   Abstract—Matrix clustering is a technique which permutes            query all needed attributes are allocated into a single frag-
rows and columns of a matrix to form densely packed regions.           ment, and this fragment contains no extra attributes. In this
It originated in the 70’s and initially was used for various object    case, one can roughly estimate that number of rows ×
grouping problems, such as machine-component grouping. The
database community noticed these algorithms and successfully           extra attributes lengths bytes can be saved during the data
applied them to the vertical partitioning problem. Recently,           reading phase in case of a slotted page data layout [2].
there has been a resurgence of interest in these algorithms.              However, if there is a query that requires attributes from
Nowadays, they are being considered for dynamic (on-line)              two or more fragments, then its performance may suffer due to
vertical partitioning and tuning of multistores.                       the record reconstruction costs. Data modification operations
   In our previous papers we have described our project aimed at
studing the applicability of recent matrix clustering algorithms       (inserts, deletes, and updates) complicate things further since
for the vertical partitioning problem. We have presented our           they involve all attributes of a record and thus, all fragments
evaluation approach and reported results concerning several of         should be modified. The impact of additional disk seeks on a
these algorithms. Our idea was to evaluate them directly using         hard drive may be so large that it can make the partitioning
the PostgreSQL database. Previous studies have found that these        scheme impractical.
algorithms can be of use if they employ the attribute replication
strategy. In this paper, we continue our investigation and consider       Due to all these facts, there is still no support of fully-
a novel algorithm of this class. Its distinctive feature is that       automatic vertical partitioning in industrial database systems.
it performs attribute replication during the branch and bound          Moreover, unlike the horizontal, vertical partitioning is not
search. We compare it with the best one of the earlier algorithms      supported in SQL DDL: e.g., in PostgreSQL it is possible
using both real and synthetic workloads.                               to define horizontal fragments using the “PARTITION BY”
   Our experiments have demonstrated that the novel algorithm
produces slightly worse configurations (about 10%), but its run
                                                                       clause for a “CREATE TABLE” statement.
times are significantly better and are almost independent of the          Nevertheless, there are multiple semi-automatic stand-alone
cohesion parameter.                                                    tools (“advisors”, see surveys [3], [4]) for this task. All of
   Index Terms—databases, database tuning, physical design,            them recommend beneficial vertical partitioning schemes for a
vertical partitioning, experimentation, matrix clustering, frag-       specified workload (queries) and let the database administrator
mentation.
                                                                       decide whether to implement them or not.
                                                                          The reason for the limited success of these tools (the over-
                       I. I NTRODUCTION
                                                                       whelming majority of them are academic research prototypes
   Vertical partitioning is a technique used to speed up query         and not industrial products) is that finding an optimal solution
processing in databases. Its core idea is dividing a table into        is an NP-hard problem for many different formulations [5]–
fragments which contain only a subset of attributes. In order          [7]. Another well-known fact is that the number of different
to ensure that the database will not undergo semantic changes,         vertical partitioning schemes for a single table is equal to
the following rules of vertical partitioning are used [1]:             the N th Bell number, where N is the number of attributes
completeness, reconstruction, and disjointness. Sometimes the          [8]. Nevertheless, due to the interest of both industrial and
disjointness rule is relaxed. In this case, it is said that vertical   academic communities, the development of such advisors
partitioning is performed with attribute replication.                  continues.
   The speedup comes from the fact that some queries would                In the core of such a system lies an algorithm that traverses
have to read less data. Indeed, suppose that for a given               the partitioning space and selects a beneficial scheme. There
are two classes of algorithms for this task: cost-based and                2) Attribute affinity is calculated for all pairs of at-
heuristic. The former employ some kind of a cost-based model                   tributes and an Attribute Affinity Matrix (AAM) is
to evaluate the quality of a given partitioning scheme in terms                constructed.
of query run times, required space, and other metrics. The                 3) A special algorithm for row and column permutation
latter proposes some kind of procedure to generate a “good”                    is applied to the AAM. Afterwards, “dense” regions
scheme. Usually, some considerations are presented as to why                   are extracted and used to define resulting partitions.
it is likely to generate a beneficial partitioning scheme, but not       Studies employing the matrix clustering approach (and
a strict proof.                                                          in particular, the ones considered in our paper) permute
    The heuristic approach was very popular in the 70’s and              AUMs, but not AAMs.
80’s, but later was abandoned in favour of the cost-based              • Graph approaches [5], [25]–[28]. Similarly to the pre-
one. Nowadays, there is a resurgence of interest in heuristic            vious type, these approaches start with a workload and
approaches due to the appearance of novel application areas:             use it to construct an AAM. However, in this case
dynamization of vertical partitioning [9]–[12], tuning of mul-           the AAM is considered as an adjacency matrix of an
tistores [13], big data applications or any other cases featuring        undirected weighted graph, where the nodes are attributes
limited resources.                                                       and the edge weights show the affinity for a given pair
    In our previous studies [14]–[16] we have described our              of attributes. Finally, this graph is used to search for
project that aims to study the applicability of several re-              special structures which will be used to define resulting
cently developed matrix clustering algorithms. Our project is            partitions. There are many approaches, e.g. Kruskal-like
motivated by the fact that the authors of these algorithms               algorithms or cutting the Hamiltonian way.
have not evaluated their performance (run times, quality)              • Data mining approaches [29]–[31]. In this type of ap-
using a DBMS and a workload. To address this, we have                    proach, association rule mining is used to derive vertical
constructed a framework for evaluating such algorithms that              fragments. The workload is considered as a transaction
uses PostgreSQL. Then we have evaluated a number of these                set, and the rules use sets of attributes as items. This
algorithms [17]–[19] using the TPC-H benchmark. In this                  group of vertical partitioning algorithms is relatively new,
paper, we continue our research and consider the most recent             so existing algorithms for association rule search are
algorithm of this type [20].                                             frequently used. For example, a popular choice is to adapt
    The rest of this paper is organized as follows. In Section II        Apriori [32] or FP-Max algorithms.
we provide a short introduction into the subject and describe
                                                                              III. M ATRIX C LUSTERING A LGORITHMS
existing types of heuristic approaches. Next, in Section III
we introduce matrix clustering algorithms and provide a              A. Basics
description of the considered algorithm. Section IV describes          The general scheme of this approach is as follows [14], [15]:
our experimental framework, setup, and the experiments. The            • Construct an Attribute Usage Matrix (AUM) from the
results of evaluation are discussed in Section V, threats to va-         workload. The matrix is defined as follows:
lidity of this study are presented in Section VI and Section VIII                        (
concludes this paper.                                                                      1, query i uses attribute j
                                                                                   Mij =
                                                                                           0, otherwise
                     II. R ELATED W ORK                                •   Cluster the AUM by permuting its rows and columns to
                                                                           obtain a block diagonal matrix.
   As it was stated in the Introduction, there are two types
                                                                        • Extract these blocks and use them to define the resulting
of approaches to the vertical partitioning problem — cost-
                                                                           partitions.
based and heuristic. Since this problem is almost 40 years
old, and a lot of results have been accumulated, we will             Some approaches do not operate on a 0-1 matrix. Instead, they
only describe studies on heuristic algorithms in this section.       modify matrix values to account for additional information
More extensive surveys that examine cost-based approaches            like query frequency, attribute size and so on. Let us consider
as well can be found in references [4], [21]. Heuristic vertical     an example. Suppose that there are six queries accessing six
partitioning algorithms can be classified into the following         attributes:
major groups [14], [15]:                                             q1: SELECT a FROM T WHERE a > 10;
  •   Attribute affinity and matrix clustering approaches [17]–      q2: SELECT b, f FROM T;
      [19], [22]–[24]. Attribute affinity is a measure which         q3: SELECT a, c FROM T WHERE a = c;
      shows how frequently two attributes are requested to-          q4: SELECT a FROM T WHERE a < 10;
      gether in a given workload. These approaches use it as         q5: SELECT e FROM T;
      follows:                                                       q6: SELECT d, e FROM T WHERE d + e > 0;
       1) A workload is used to construct an Attribute Usage           The next step is the creation of an AUM using this work-
          Matrix (AUM), a special way to represent which             load. The resulting matrix is shown in Figure 1a. After the
          attributes are used by each query of a workload.           application of a matrix clustering algorithm, the reordered
                         a   b    c    d   e   f                                              a   c   b   f   d   e
                   q1    1   0    0    0   0   0                                         q1   1   0   0   0   0   0
                   q2    0   1    0    0   0   1                                         q3   1   1   0   0   0   0
                   q3    1   0    1    0   0   0                                         q4   1   0   0   0   0   0
                   q4    1   0    0    0   0   0                                         q2   0   0   1   1   0   0
                   q5    0   0    0    0   1   0                                         q6   0   0   0   0   1   1
                   q6    0   0    0    1   1   0                                         q5   0   0   0   0   0   1
                             (a) AUM                                                          (b) Reordered AUM
                                                   Fig. 1: Matrix clustering algorithm


                     a   b    c   d    e   f                                                  IV. E XPERIMENTS
                     1   1    1   0    0   0                         A. Benchmarking
                     1   1    1   0    0   0
                     1   1    1   0    0   0                            In our previous works we have developed a special pro-
                     1   0    0   1    1   0                         totype for experimental evaluation of matrix clustering algo-
                     1   0    0   1    1   0                         rithms. The idea of our approach is to directly check whether
                     1   0    0   0    0   1                         the generated partitioning schemes help to improve query
                                                                     performance. For these purposes we employ the PostgreSQL
              Fig. 2: Non-decomposable matrix                        DBMS and several workloads, both real and synthetic.
                                                                        The architecture of our prototype is presented in Figure 3.
                                                                     It consists of the following modules:
AUM (Figure 1b) is acquired. The resulting fragments are the
                                                                        • The parser reads the workload from a file. It extracts the
following: (a, b), (b, f ), (d, e).
                                                                          queries and passes them to the executor, so that their
   However, not all matrices are fully decomposable. Consider
                                                                          execution times can be measured. It also constructs the
the matrix presented in Figure 2. The first column obstructs
                                                                          AUM, which serves as input for the selected algorithm.
the perfect decomposition into several clusters. In this case, the
                                                                        • The algorithm identifies clusters and passes that informa-
algorithm should produce a decomposition which minimally
                                                                          tion to the partitioner to create corresponding temporary
harms query processing and results in an overall performance
                                                                          tables.
improvement. Matrix clustering algorithms employ different
                                                                        • The query rewriter also receives this information. It
strategies to select such a decomposition.
                                                                          replaces the name of the original table with the ones that
   A systematic review of matrix clustering algorithms is
                                                                          were generated by the partitioner.
presented in studies [14], [15]. Here, we will consider only
                                                                        • The partitioner generates new names and sends partition-
the recent approaches.
                                                                          ing commands to the database. The exact commands are
                                                                          SELECT INTO and ALTER TABLE. The latter lets it
B. Recent Advances
                                                                          transfer primary keys.
   Within our project, we study a series of works by Chun-              • The executor accepts queries and sends them to Post-
Hung Cheng et al [17]–[20]. These algorithms employ a                     greSQL to measure the time of execution.
branch and bound search that tries to find submatrices that
conform to specific conditions. Their input is the threshold         B. Experimental Setup and Evaluation Procedure
(target cohesion), which is defined as the share of 1’s in the          In our experiments, we have used the following hardware
resulting matrices.                                                  and software setup:
   In this study we are interested in two algorithms — A09 [19]
                                                                        • Inspiron 15 7000 Gaming (0798), 8GiB, Intel(R)
and A11 [20].
                                                                           Core(TM) i5-7300HQ CPU @ 2.50GHz, TOSHIBA 1TB
   The A09 algorithm comes with three different strategies
                                                                           MQ02ABD1
that define the treatment of intersubmatrix attributes (the ones
                                                                        • Ubuntu 18.10, PostgreSQL 11.1, gcc 8.2.0
that were marked as obstacles to decomposition) — nearest,
separate, and replicate. In the first one such attribute goes to        Data for quality-related graphs was obtained by running
the nearest submatrix, in the second all such attributes are         10 invocations of the respective algorithm and averaging the
assigned to a dedicated submatrix, and the last one replicates       result. We deemed a single run sufficient for run time graphs,
the attribute into each submatrix that requires it. Note that the    since even one invocation can require up to two hours.
strategy is applied after the clustering is done.                       In order to ensure maximum quality of experiments, several
   The A11 algorithm has a different idea. If during branch          measures were taken:
and bound traversal the algorithm encounters such an attribute,         1) We eliminated data caching for both operating sys-
then it replicates it and tries to decompose the matrices further.          tem caches and PostgreSQL caches. For this, we
                                     vpart

                        workload        parser                                executor
                                                          rewriter                                       PostgreSQL
                                      algorithm                              partitioner

                                             Fig. 3: The architecture of our approach


     have restarted PostgreSQL and dropped the operat-               algorithm, and run times determine its suitability for on-line
     ing system caches before running each query. Oper-              vertical partitioning.
     ating system caches were dropped by writing “3” to                 To evaluate the quality of partitioning, we have compared
     /proc/sys/vm/drop_caches.                                       algorithm A11 to the best of other matrix clustering algorithms
  2) Next, we manually checked plans for each query and              (according to our previous studies [14], [15]) — A09. This
     noticed that some queries may have different scan               algorithm has three different strategies that were described
     operator implementations depending on the table. Fre-           earlier. In our experiments we compare the quality of resulting
     quently, a query on a partitioned table did not have            partitions of all three of them with the ones obtained by A11.
     sequential scan, but rather parallel. To handle this,              To conduct experiments we have employed the “Star” table
     we have restricted the query optimizer to use only              of the SDSS (Sloan Digital Sky Survey) dataset. The SDSS is a
     sequential scans by issuing the following command set           publicly available astronomical database that contains detailed
     max_parallel_workers_per_gather to 0;.                          three-dimensional maps of the Universe. It is frequently used
   To ensure that no hidden caching or other unaccounted             as a testing dataset in various data partitioning studies. We
processes happen, we have designed the following simple              have used the following pack: SDSS-IV Data Release 14,
criterion. Suppose that we have a set of queries that involve        2016. Its “Star” table contains 509 attributes and 492515
only a single table and are essentially scans without complex        records.
data processing. Initially, we run these queries on the original        To obtain representative workloads, we have also used the
table and record their run times. Then, for every query we           SDSS dataset. In SDSS, it is possible to see what queries users
designate a table that will contain all attributes necessary to      have issued via a special website1 . Using this website, we have
evaluate it. Thus, no joins are needed. At the same time, for        selected 8 queries from the workload that address solely this
some queries, the tables assigned to them will also contain          table.
extra attributes. Therefore, some tables may serve more than            In our first experiment we have varied the cohesion measure
one query. Then we run each query on corresponding table             (a ratio of 1 in the resulting matrices) for three strategies of
and record its run time. Eventually the following two values         A09 and compared it with A11. The results are presented in
should be approximately equal:                                       Figure 5a. On this chart, each bar represents the performance
       P
   1) Pqi ∈Queries (size(T )/time(qi ))                              of an individual algorithm with the corresponding strategy.
                                                                     There also are two horizontal lines: not clustered and pinched
   2)    qi ∈Queries (size(table(qi ))/time(qi ))
                                                                     not clustered. The first one is the workload run time on
   In these equations size(T ) is the size of a table in bytes.
                                                                     the original, unmodified table. The second is the workload
Functions time(qi ) and table(qi ) return the time it took to run
                                                                     run time on the cleaned up original table, containing only
a query qi and a table that corresponds to query qi .
                                                                     30 attributes that are referenced in the workload. In this
   In other words, the idea is to check that workload run times      experiment we varied the cohesion measure parameter.
depend solely on the size of the table.
                                                                        To evaluate algorithm run times we used both SDSS and
   Having applied all the aforementioned measures, we have           synthetic (generated) tests. The results of the SDSS tests are
obtained the difference of about 10 − 15% in these values. We        presented in Figure 5b. Here, we also vary cohesion for the
deemed such a result acceptable and decided to start evaluating      same four algorithms.
the algorithms.
                                                                        In the synthetic tests, we have tried to study the scalability
   Finally, we must note that our matrix clustering algorithms
                                                                     of the A11 algorithm in terms of run times. For this, we
are parallel [16]. However, in this paper we did not consider
                                                                     have generated a set of random 0-1 matrices with different
them and instead employed their sequential versions.
                                                                     probabilities of having 1 in each position (cohesion). Then,
                                                                     we have examined the dependency of the run time on the
C. Experiments                                                       size of the matrix. The specified threshold was set to 0.9
   In our study, we have addressed two applicability aspects of      in all experiments. If the threshold is more than the used
matrix clustering algorithms: quality of generated partitioning      cohesion, then a solution (the original matrix) is found almost
schemes and algorithm run times. Both of them are important
since quality is the primary characteristic of any partitioning        1 http://skyserver.sdss.org/log/en/traffic/
immediately. We also set a time limit of 2 hours, after reaching         estingly, for high cohesion values A11 produces more
which the algorithm is stopped.                                          fragments, but does not help to improve performance.
   We started with square matrices (see Figure 4a), then
                                                                                     VI. T HREATS TO VALIDITY
separately evaluated the influence of the number of columns
(Figure 4b) and the number of rows (Figure 4c) on the                  We have identified a number of issues that should be kept
algorithm run time. In the last two experiments we fixed one        in mind while discussing our results:
dimension to 20 and increased the other up until the time limit        1) First of all, the policy of database restarts after each
was reached.                                                              query may be unfair. In real-life scenarios where these
   Finally, we have looked into the storage requirements of               algorithms will hypothetically be used, database caching
these algorithms (Figure 6). Here, we show the required disk              would be present. However, such scenarios are nearly
space for each generated configuration. On top of each bar,               impossible to simulate since they require hundreds or
an overall number of fragments is shown. We have also                     thousands of real queries and more important, their
divided each bar into parts representing the sizes of resulting           frequencies and arrival patterns.
fragments. The sizes of original and pinched tables are shown          2) Next, the SDSS dataset is only a single dataset, so
by horizontal lines.                                                      the results may differ on other datasets. Moreover, it
                                                                          is a scientific dataset used by the astronomy research
               V. R ESULTS AND D ISCUSSION                                community and therefore, its queries and data may not
                                                                          be comparable to the industrial ones. Nevertheless, it
  • All of the algorithms produced partitioning schemes that              is popular in the vertical partitioning community (e.g.
    provide better performance than the original and pinched              see [33]–[36]) due to the lack of industrial schema-less
    tables, regardless of the cohesion value.                             benchmarks.
  • The quality of produced solutions heavily depends on the           3) There may be errors in our implementation of these
    cohesion value. Starting with the cohesion value of 0.8               algorithms. In order to mitigate this threat we have tested
    results of A11 start to rival the results of the best A09             our implementation on example matrices presented in
    strategies. However, up to this point, the clear winner is            the considered papers and ensured that the resulting par-
    A09 with replication.                                                 titioned matrices are the same. Furthermore, to address
  • Overall, the best result was produced by a replicating                this issue we plan to release the source code on GitHub.
    variant of A09 (3.358, cohesion=0.55), with a separate             4) Contemporary DBMSes are very complex systems in
    variant of A09 being the fourth (3.500, cohesion=0.8),                which minimal changes to inputs may drastically affect
    and A11 being the fifth (3.553, cohesion=0.8).                        performance. Therefore, during experimental evaluation
  • It is interesting to note that there is some sort of a global         performance may change not due to vertical partitioning,
    minimum at the 0.7 point. Here, the total time over all               but due to other events, such as query optimizer selecting
    algorithms is minimal in the whole cohesion range.                    a completely different plan. To counter this we have
  • With the SDSS workload algorithm A11 works almost                     carefully checked query execution plans to find and
    ten times faster than A09, regardless of the employed                 eliminate any inconsistencies. We have also devised a
    strategy. Note that increasing the target threshold also              criterion that allows to detect such inconsistencies in
    increases run times. For A09, run times increased from                simple cases.
    less than 1 second to almost 140 seconds, while A11 took           5) We have considered a relatively simple workload which
    0.06 and 0.119 seconds respectively.                                  involves only a single table. Having to perform extra
  • The scalability of A11 is not as good as desired. However,            joins in addition to the partitioning-induced ones may
    two points should be taken into account. Firstly, run                 significantly decrease overall performance and thus, the
    times depend on the number of referenced attributes                   desirability of vertical partitioning. However, joins with
    in the workload, not on the total number. Secondly, in                other tables are extremely rarely considered in litera-
    our scalability experiments we used an extremely large                ture [3]: only a handful of studies address them.
    threshold of the cohesion — 0.9. Finally, the author [20]
    noted that it is possible to interrupt the algorithm earlier                    VII. ACKNOWLEDGEMENTS
    while still obtaining decent results. Therefore, further          We would like to thank Anna Smirnova for contributing to
    studies are needed.                                             the editing and proofreading process.
  • Increasing the number of attributes impacts run times
    more than increasing the number of queries. In two hours                             VIII. C ONCLUSION
    time it is possible to process either a 20 × 25 matrix or          In this paper we have presented a quantitative study of
    a 205 × 20 one.                                                 two recent matrix clustering algorithms. We have studied their
  • The solutions produced by all algorithms require from           output quality, run times, and storage requirements using both
    1.5 to 2 times more disk space than the pinched table.          synthetic and real datasets.
    Increasing the target threshold increases the number of            Our evaluation has shown that for schema-less data all
    fragments and the overall required disk space. Inter-           algorithms can produce a beneficial configuration, while a
(a) A11 run times on a square matrix                        (b) Dependency of A11 run times on matrix width


                            (c) Dependency of A11 run times on matrix height
          Fig. 4: Run times of the A11 matrix clustering algorithm, synthetic datasets.


     (a) Quality of partitioning                                         (b) Algorithm run times
     Fig. 5: Performance of the A11 and A09 matrix clustering algorithms, SDSS datasets.
                                                              Fig. 6: Storage requirements.


replicating variant of A09 is 10% better than A11. However,                       [14] V. Galaktionov, G. Chernishev, B. Novikov, and D. Grigoriev, “Matrix
A11 is significantly faster and more importantly, less impacted                        clustering algorithms for vertical partitioning problem: an initial
                                                                                       performance study,” in Selected Papers of the XVIII International
by the target threshold parameter.                                                     Conference on Data Analytics and Management in Data Intensive
                                                                                       Domains (DAMDID/RCDL 2016), Ershovo, Moscow Region, Russia,
                                                                                       October 11-14, 2016., 2016, pp. 24–31. [Online]. Available: http://ceur-
                              R EFERENCES                                              ws.org/Vol-1752/paper05.pdf
                                                                                  [15] V. Galaktionov, G. Chernishev, K. Smirnov, B. Novikov, and D. A.
                                                                                       Grigoriev, “A study of several matrix-clustering vertical partitioning
 [1] M. T. Özsu and P. Valduriez, Principles of Distributed Database Systems,         algorithms in a disk-based environment,” in Data Analytics and Man-
     3rd ed. Springer Publishing Company, Incorporated, 2011.                          agement in Data Intensive Domains, L. Kalinichenko, S. O. Kuznetsov,
 [2] R. Ramakrishnan and J. Gehrke, Database Management Systems, 3rd ed.               and Y. Manolopoulos, Eds. Cham: Springer International Publishing,
     New York, NY, USA: McGraw-Hill, Inc., 2003.                                       2017, pp. 163–177.
 [3] G. Chernishev, “A survey of dbms physical design approaches,”                [16] V. Galaktionov, “Parallelization of matrix clustering algorithms
     SPIIRAS Proceedings, vol. 24, pp. 222–276, 2013. [Online]. Available:             (accepted),” in Proceedings of the Sixth International Conference
     www.mathnet.ru/trspy580                                                           on Informatics Problems (SPISOK 2016), 2016. [Online]. Available:
 [4] ——, “Vertical Partitioning in Relational DBMS,” 30 4 2015,                        http://spisok.math.spbu.ru/2016/txt/SPISOK-2016.pdf
     talk at the Moscow ACM SIGMOD chapter meeting; slides and
                                                                                  [17] C.-H. Cheng, “A branch and bound clustering algorithm,” Systems, Man
     video: http://synthesis.ipi.ac.ru/sigmod/seminar/s20150430.html [Ac-
                                                                                       and Cybernetics, IEEE Transactions on, vol. 25, no. 5, pp. 895–898,
     cessed: 2015 11 09].
                                                                                       1995.
 [5] X. Lin, M. Orlowska, and Y. Zhang, “A graph based cluster approach
     for vertical partitioning in database design,” Data & Knowledge Engi-        [18] C. Cheng, “Algorithms for vertical partitioning in database physical
     neering, vol. 11, no. 2, pp. 151–169, 1993.                                       design,” Omega, vol. 22, no. 3, pp. 291–303, 1994.
 [6] D. Sacca and G. Wiederhold, “Database partitioning in a cluster of           [19] C.-H. Cheng and J. Motwani, “An examination of cluster
     processors,” ACM Trans. Database Syst., vol. 10, pp. 29–56, 1985.                 identification-based algorithms for vertical partitions,” Int. J. Bus.
                                                                                       Inf. Syst., vol. 4, no. 6, pp. 622–638, 2009. [Online]. Available:
 [7] P. M. G. Apers, “Data allocation in distributed database systems,” ACM
                                                                                       http://dx.doi.org/10.1504/IJBIS.2009.026695
     Trans. Database Syst., vol. 13, pp. 263–304, 1988. [Online]. Available:
     http://doi.acm.org/10.1145/44498.45063                                       [20] C.-H. Cheng, K.-F. Wong, and K.-H. Woo, “An improved branch-
 [8] M. Hammer and B. Niamir, “A heuristic approach to attribute                       and-bound clustering approach for data partitioning,” International
     partitioning,” in Proceedings of the 1979 ACM SIGMOD international                Transactions in Operational Research, vol. 18, no. 2, pp. 231–255, 2011.
     conference on Management of data, ser. SIGMOD ’79. New                            [Online]. Available: http://dx.doi.org/10.1111/j.1475-3995.2010.00781.x
     York, NY, USA: ACM, 1979, pp. 93–101. [Online]. Available:                   [21] G. Chernishev, “A survey of DBMS physical design approaches
     http://doi.acm.org/10.1145/582095.582110                                          (in Russian),” Tr. St. Petersburg Inst. Infor. Avtom. Ross. Akad.
 [9] A. Jindal and J. Dittrich, “Relax and let the database do the partitioning        Nauk SPIIRAN, vol. 24, pp. 222–276, 2013. [Online]. Available:
     online,” ser. Lecture Notes in Business Information Processing, 2012,             www.mathnet.ru/trspy580
     vol. 126, pp. 65–80.                                                         [22] N. Gorla and W. J. Boe, “Database operating efficiency
[10] L. Li and L. Gruenwald, “Self-managing online partitioner for databases           in fragmented databases in mainframe, mini, and mi-
     (smopd): A vertical database partitioning system with a fully automatic           cro system environments,” Data & Knowledge Engineering,
     online approach,” ser. IDEAS ’13, 2013, pp. 168–173.                              vol. 5, no. 1, pp. 1–19, 1990. [Online]. Available:
[11] L. Rodrı́guez and X. Li, “A dynamic vertical partitioning approach for            http://www.sciencedirect.com/science/article/pii/0169023X9090030H
     distributed database system,” in Systems, Man, and Cybernetics (SMC),        [23] J. A. Hoffer and D. G. Severance, “The use of cluster analysis in
     2011 IEEE International Conference on, 2011, pp. 1853–1858.                       physical data base design,” ser. VLDB ’75, 1975, pp. 69–86. [Online].
[12] L. Rodrı́guez, X. Li, and P. Mejı́a-Alvarez, “An active system for                Available: http://doi.acm.org/10.1145/1282480.1282486
     dynamic vertical partitioning of relational databases,” ser. Lecture Notes   [24] N. Bobrov, G. Chernishev, and B. Novikov, “Workload-independent
     in Computer Science. Springer Berlin Heidelberg, 2011, vol. 7095, pp.             data-driven verticalpartitioning,” in New Trends in Databases and In-
     273–284.                                                                          formation Systems, M. Kirikova, K. Nørvåg, G. A. Papadopoulos,
[13] J. LeFevre, J. Sankaranarayanan, H. Hacigumus, J. Tatemura,                       J. Gamper, R. Wrembel, J. Darmont, and S. Rizzi, Eds. Cham: Springer
     N. Polyzotis, and M. J. Carey, “Miso: Souping up big data query                   International Publishing, 2017, pp. 275–284.
     processing with a multistore system,” ser. SIGMOD ’14, 2014, pp. 1591–       [25] C.-H. Cheng, W.-K. Lee, and K.-F. Wong, “A genetic algorithm-
     1602. [Online]. Available: http://doi.acm.org/10.1145/2588555.2588568             based clustering approach for database partitioning,” Systems, Man, and
     Cybernetics, Part C: Applications and Reviews, IEEE Transactions on,              in large databases,” ser. VLDB ’94, 1994, pp. 487–499.
     vol. 32, no. 3, pp. 215–230, 2002.                                         [33] S. Papadomanolakis and A. Ailamaki, “AutoPart: automating schema
[26] S. B. Navathe and M. Ra, “Vertical partitioning for database design: a            design for large scientific databases using data partitioning,” in Scientific
     graphical algorithm,” ser. SIGMOD ’89, 1989, pp. 440–450.                         and Statistical Database Management, 2004. Proceedings. 16th Inter-
[27] J. Du, K. Barker, and R. Alhajj, “Attraction — a global affinity measure          national Conference on, june 2004, pp. 383–392.
     for database vertical partitioning,” in ICWI. IADIS, 2003, pp. 538–548.    [34] M. G. Ivanova, M. L. Kersten, N. J. Nes, and R. A.
[28] J. H. Son and M.-H. Kim, “α-partitioning algorithm: Vertical partition-           Gonçalves, “An architecture for recycling intermediates in a
     ing based on the fuzzy graph,” ser. DEXA ’01, 2001, pp. 537–546.                  column-store,” in Proceedings of the 35th SIGMOD international
     [Online]. Available: http://dl.acm.org/citation.cfm?id=648314.755837              conference on Management of data, ser. SIGMOD ’09. New
[29] M. Bouakkaz, Y. Ouinten, and B. Ziani, “Vertical fragmentation of data            York, NY, USA: ACM, 2009, pp. 309–320. [Online]. Available:
     warehouses using the FP-Max algorithm,” in Innovations in Information             http://doi.acm.org/10.1145/1559845.1559879
     Technology (IIT), 2012 International Conference on, march 2012, pp.        [35] T. Malik, X. Wang, D. Dash, A. Chaudhary, A. Ailamaki, and R. Burns,
     273–276.                                                                          “Adaptive physical design for curated archives,” in Proceedings of the
[30] N. Gorla and B. P. W. Yan, “Vertical fragmentation in databases using             21st International Conference on Scientific and Statistical Database
     data-mining technique,” in Database Technologies: Concepts, Method-               Management, ser. SSDBM 2009. Berlin, Heidelberg: Springer-Verlag,
     ologies, Tools, and Applications, J. Erickson, Ed. IGI Global, 2009,              2009, pp. 148–166. [Online]. Available: http://dx.doi.org/10.1007/978-
     pp. 2543–2563.                                                                    3-642-02279-1 11
[31] L. Rodriguez and X. Li, “A support-based vertical partitioning method      [36] I. Alagiannis, S. Idreos, and A. Ailamaki, “H2o: A hands-free adaptive
     for database design,” in Electrical Engineering Computing Science and             store,” in Proceedings of the 2014 ACM SIGMOD International
     Automatic Control (CCE), 2011 8th International Conference on, oct.               Conference on Management of Data, ser. SIGMOD ’14. New
     2011, pp. 1–6.                                                                    York, NY, USA: ACM, 2014, pp. 1103–1114. [Online]. Available:
[32] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules   http://doi.acm.org/10.1145/2588555.2610502