=Paper=
{{Paper
|id=None
|storemode=property
|title=Challenges in Workload Analyses for Column and Row Storess
|pdfUrl=https://ceur-ws.org/Vol-581/gvd2010_8_1.pdf
|volume=Vol-581
|dblpUrl=https://dblp.org/rec/conf/gvd/Lubcke10
}}
==Challenges in Workload Analyses for Column and Row Storess==
<pdf width="1500px">https://ceur-ws.org/Vol-581/gvd2010_8_1.pdf</pdf>
<pre>
             Challenges in Workload Analyses for Column and
                               Row Stores                                                     ∗


                                                                       Andreas Lübcke
                                                         School of Computer Science
                                               Otto-von-Guericke-University Magdeburg, Germany
                                                             andreas.luebcke@ovgu.de


ABSTRACT                                                                             tions [3, 14]. In contrast, column stores perform worse than
Within the DWH domain, a new architecture known as col-                              row stores on update and tuple operations, i.e., processing
umn store is emerging especially to improve the performance                          entire tuples. Operation types and their frequency within
of data analyses/aggregations. Column stores offer good                              a workload differ from application to application, thus we
results in performance benchmarks like TPC-H. In recent                              have to analyze their workloads to select the most suitable
years, row stores dominate the data warehousing domain.                              architecture. In other words, we have to contrast the perfor-
We consider application fields for both architectures. We                            mance of row stores with the performance of column stores
want to figure out and examine different application fields                          for different applications.
for row and column stores. To encourage our assumption, we                              In this paper, we analyze the differences between column
perform a case study based on the TPC-H benchmark. To                                and row stores to show the increased challenge of system de-
the best of our knowledge, there is no advisor for selection                         sign within the DWH domain. We illustrate new challenges
of storage architecture for a given application. We present                          for workload analyses and introduce a solution. Afterwards,
an idea to overcome the workload analysis problems across                            our study illustrates that there are application fields for both
different architectures in the DWH domain.                                           architectures within the DWH domain.


1.     INTRODUCTION                                                                  2.   COLUMN VS. ROW STORES
   Database management systems (DBMSs) are pervasive in                                 This section gives an overview to the differences in storage
current applications. With database tuning, practitioners                            method and functionality between column and row stores.
aim at optimal performance especially for large-scale sys-                           Due to the differences, we show the increased complexity of
tems like a data warehouse (DWH). The administration                                 system design and the difficulty of performance estimation
and optimization of DBMSs is costly. In recent years, row-                           across row and column stores.
oriented DBMSs (row stores) dominate the DWH domain                                     First, the storage architecture differs in the type of par-
but a new approach known as column-oriented DBMSs (col-                              titioning a (relational) table. Row stores horizontally par-
umn stores) arises [2]. The column stores should increase the                        tition a table, i.e., all attribute values of one tuple are se-
performance of analyses in the DWH domain [1, 9, 14, 18].                            quentially stored. In contrast, column stores sequentially
In the DWH domain, the column stores outperform the row                              store the values of an attribute (column). Hence, column
stores lately and offer good results in performance bench-                           stores have to reconstruct the tuples during the query ex-
marks like TPC-H1 . Consequently, the complexity of system                           ecution if more than one column is affected. In the DWH
design increases through the new available DBMSs because                             domain, we mostly process aggregations over one column.
the design process also implies the selection of a suitable sys-                     Column stores reduce the overhead for aggregations, e.g.,
tem architecture now. To the best of our knowledge, there is                         I/O costs, because the affected column is directly accessed.
no framework available that advises the optimal architecture                         However, the vertical partitioning in column stores also im-
for a given application or workload. Heuristics and stud-                            plies worse performance on update processing. That means,
ies show that column stores perform very well on aggrega-                            caused by the column-oriented architecture, the tuples have
∗                                                                                    to be reconstructed and partitioned subsequently during up-
  We thank Martin Kuhlemann and Ingolf Geist for helpful                             date processing. In contrast, row stores perform better on
discussions and comments on earlier drafts of this paper.
1                                                                                    tuples (operations) and updates that are also important op-
  Refer http://www.tpc.org/tpch/results/tpch perf results.asp
for the latest results.                                                              erations within workloads even in the DWH domain. Several
                                                                                     approaches [1, 9, 12, 16] attempt to solve the update prob-
                                                                                     lems of column stores but these approaches do not reach the
                                                                                     performance of row stores. Abadi et al. show that verti-
Permission to make digital or hard copies of all or part of this work for            cal partitioning of relations in row stores is not a suitable
personal or classroom use is granted without fee provided that copies are            compromise, too [3].
not made or distributed for profit or commercial advantage and that copies              Second, column and row stores also differ in their func-
bear this notice and the full citation on the first page. To copy otherwise, to      tionality. Row stores utilize indexes and materialized views
republish, to post on servers or to redistribute to lists, requires prior specific   to improve the performance. Therefore, DBMS vendors and
permission and/or a fee.
GvD Workshop’10, 25.-28.05.2010, Bad Helmstedt, Germany                              researchers develop a number of self-tuning techniques [5] to
Copyright 2010 ACM Copyright is held by the author/owner(s). ...$10.00.              automatically tune the DBMSs. These techniques, e.g., in-
dexes and index self-tuning, do not exist for column stores,       chitectures on certain operations, it seems to be obvious to
i.e., column stores do not utilize mature frameworks/tools         analyze the operations itself. Consequently, current work-
for self-tuning like row stores according to shifting work-        load analysis approaches and estimation tools have to be
loads [5]. In contrast, column stores have a better support of     adapted because these approaches process on entire queries
compression techniques than row stores [1]. Column stores          and their structure and do not offer the opportunity to ana-
offer a number of compression techniques that can be chosen        lyze operations of a query. We argue that this step is neces-
for each column concerning its data type. Moreover, some           sary because certain operations, e.g., tuple reconstructions
column stores can directly process on compressed data [1].         in column stores, are not appraisable from a given query
Row stores have to use one compression for a tuple or tu-          and its structure. Furthermore, a certain operation can di-
ple partition [3], i.e., the selected compression techniques is    rectly influence the overall performance of a query, e.g., tuple
a compromise to satisfy different data types. Furthermore,         operations. Due to heuristics, column stores will be cho-
column and row use different query processing techniques           sen for workloads that contain an amount of aggregations.
caused by the different storage architectures. On the one          This assumption is often applicable but we cannot gener-
hand, column stores have to reconstruct tuples while query         alize it. If the workload contains a huge number of tuple
processing whereby the point of time for reconstruction cru-       operations besides the amount of aggregations then column
cially influence the query processing [4]. On the other hand,      stores perform poorly because they have reconstruct a lot
row stores can directly process on several columns (no tuple       of tuples. Tuple reconstructions are not necessary for row
reconstruction) but they have to access entire tuples even if      stores, thus we have to analyze the operations of a query to
just one column have to be processed. Row store query pro-         estimate the overall performance of a query correctly. With
cessors are always tuple-oriented no matter how the data is        help of workload analyses based on operations, we can ob-
partitioned. Column stores can utilize a row-oriented query        tain weighted comparable estimations according to different
processor as well as a column-oriented query processor [1],        operations even if certain operations only exist for one of
i.e., the performance of a column store is already affected by     the both architectures, e.g., tuple reconstruction for column
the selection of the query processor.                              stores.
   In conclusion, we state that the complexity of database de-
sign (and also tuning) has been increased by the appearance        4.     CASE STUDY: TPC-H ON DIFFERENT
of column stores within the DWH domain. In recent years,
we estimate the performance of row stores for a given work-               ARCHITECTURES
load and tune these systems concerning the given workload.            In this section, we present our case study according to the
We are able to compare several systems (row stores) because        TPC-H benchmark on a column store and a row store. We
their core functionality only differs very slightly. Today, we     describe our study environment and introduce our assump-
have to estimate the performance of DBMSs across two ar-           tions for this study. Afterwards, we discuss the results of
chitectures, i.e., we have to choose the most suitable ar-         the benchmark runs.
chitecture for a given workload. One can argue that column
stores are more suitable for DWH applications in general be-       4.1      Environment & Assumptions
cause column stores perform better on essential operations            Our test environment is an Ubuntu 9.10 64bit system run-
for the DWH domain, e.g., aggregations. We argue there are         ning on Samsung X65 with a 2.2GHz dual core processor,
application fields for column and row stores because none of       2GB RAM, and 2GB swap partition. We decide to use In-
the two architectures is suitable for every workload. Cur-         fobright ICE2 3.2.2 and MySQL3 5.1.37 for our study.
rent approaches [1, 9, 11, 12, 16] confirm our position, e.g.,        Thereby, ICE represents the column stores and MySQL
updates in real-time DWHs [17].                                    represents the row stores. Our decision is based on two main
                                                                   reasons. First, both DBMSs are freely available, and second
                                                                   both DBMSs are relatively similar. Both systems use the
3.   NEW CHALLENGES FOR WORKLOAD                                   common MySQL kernel/management services except that
     ANALYSES                                                      they utilize different storage architectures. Of course, In-
                                                                   fobright adds functionality to the underlying MySQL, e.g.,
   New applications demand for performant aggregations in
                                                                   another storage manager, but there are no other DBMSs
column stores as well as performant tuple operations in row
                                                                   that utilize different storage architectures and are as similar
stores or at least row store functionality. Hence, we need
                                                                   as these two DBMSs. To the best of our knowledge, there
new approaches for workload analyses because we have to
                                                                   are no DBMSs more suitable to compare impacts on column
compare different DBMSs (row vs. column store) that differ
                                                                   and row store even if ICE is already a DBMS for DWH ap-
significantly in their storage, functionality, and query pro-
                                                                   plications and MySQL is implemented for transactional pro-
cessing techniques.
                                                                   cessing (OLTP). We adjust both DBMSs configurations to
   In recent years, we compare different tuning and optimiza-
                                                                   the MySQL standard configuration to guarantee the com-
tion techniques of DBMSs to figure out the most suitable
                                                                   parability of the results. For both systems, no additional
DBMSs for a given workload. Row stores have only minor
                                                                   indexes or views are created except indexes and views that
differences in functionality and they often only differ in their
                                                                   are caused by the DDL (primary key) or by the workload
implementation. Therefore, workload analyses based on en-
                                                                   itself.
tire queries are suitable because the queries are processed in
                                                                      To exclude impacts from a poor chosen workload, we use
the same way and the comparison of query execution times
                                                                   the standardized TPC-H (2.8.0) benchmark with scale fac-
is sufficiently.
                                                                   tor 1 (1GB). This benchmark is representative for the DWH
   To compare (systems with) different architectures, we have
                                                                   2
to estimate the performance of a given workload for both               http://www.infobright.org
                                                                   3
architectures. Due to the different performance of both ar-            http://www.mysql.org
domain. The data (1GB) does not fit completely into RAM                             Standard TPC-H          Adjusted TPC-H
for MySQL standard configuration, e.g., 16MB key buffer             Query #       ICE        MySQL        ICE        MySQL
size is much less than 1GB. We run two series of tests with         TPC-H Q1      00:00:25   00:00:26     00:01:18   00:00:28
the TPC-H benchmark to show that there are still applica-           TPC-H Q2      00:00:45   00:01:31     00:01:09   00:01:34
tion fields for row stores in the DWH domain, thus column           TPC-H Q3      00:00:03 00:00:28       00:01:11 00:00:27
                                                                    TPC-H Q4      00:02:32   00:00:05     00:02:42   00:00:05
stores do not outperform row stores at each query.
                                                                    TPC-H Q5      00:00:03   00:01:25     00:01:06   00:01:31
   We run one series of tests with the standard TPC-H bench-
                                                                    TPC-H Q6      00:00:00   00:00:03     00:00:40   00:00:04
mark to obtain reference values. The second series of tests         TPC-H Q7      00:00:03 00:00:30       00:00:04 00:00:30
run with an adjusted TPC-H benchmark. We adjusted the               TPC-H Q8      00:00:02   00:00:05     00:00:02   00:00:05
TPC-H benchmark in the following way. We want to show               TPC-H Q9      00:00:05   00:00:50     00:01:09   00:00:48
that a storage architecture decision can be easily shifted by       TPC-H Q10     00:00:08   00:00:10     00:02:06   00:00:12
changing workloads. We modify the TPC-H benchmark,                  TPC-H Q11     00:00:01   00:00:00     00:00:22   00:00:01
i.e., we change the number of returned attributes for each          TPC-H Q12     00:00:02   00:00:04     00:01:00   00:00:04
query. Hence, each query returns the result without pro-            TPC-H Q14     00:00:01   00:00:32     00:00:43   00:00:31
jection (in the SELECT-statement). Listing 1 shows an ex-           TPC-H Q15     00:00:01 00:00:08       00:00:02 00:00:08
emplary adjusted TPC-H query. We have to add a GROUP                TPC-H Q16     00:00:01   00:00:09     00:00:24   00:00:12
BY-statements to the queries Q6, Q14, Q17, and Q19 to create        TPC-H Q17     00:24:15   00:00:01     00:24:41   00:00:01
valid SQL statements because more than one attribute has            TPC-H Q19     00:00:03 00:00:00       00:00:31 00:00:00
                                                                    TPC-H Q20     00:10:48   00:00:01     00:10:51   00:00:00
to be processed now. We decide to group all four queries by
                                                                    TPC-H Q22     00:19:21   00:00:01     00:19:23   00:00:01
the same attribute (l shipdate), i.e., each query is extended
by GROUP BY l shipdate. We also apply these changes
                                                                  Table 1: Comparison of Query Execution Times for
to the series of tests with standard TPC-H benchmark to
                                                                  ICE and MySQL on TPC-H and adjusted TPC-H
guarantee the comparability.
   Finally, we state that we exclude three queries from our
test series. First, Q13 is not executable on MySQL-syntax.
                                                                  We assume that the costs for tuple reconstructions for these
Second, we remove Q18 from our test series because MySQL
                                                                  queries do not have a major share of total costs. The result
is not able to finish the query. We abort the execution after
                                                                  sets and the interim results are comparatively small. Hence,
over 21 hours. In contrast, the execution time on ICE is
                                                                  there are other operations within these queries that cause
only 8 seconds. Third, Q21 has an extreme execution time
                                                                  the major part of the costs. Consequently, we cannot figure
of 6 hours on ICE that indicates optimizer problems for this
                                                                  out general decision rules. We have to analyze the influence
query. MySQL executes Q21 in only 2 minutes and 48 sec-
                                                                  of single operations to the total costs of a query.
onds.
                                                                     Queries Q4, Q17, Q19, and Q22 have to be separately con-
   Our results of the two test series are shown in Table 1 and
                                                                  sidered. The long query execution times for ICE indicate
will be discussed in the following section.
                                                                  the same issue with respect to the query structure. We as-
                                                                  sume that the ICE optimizer or the ICE query processor
4.2    Discussion                                                 has an issue while processing nested queries. However, our
   Our study shows different impacts on the query execution       results also show that there is only a negligible impact for
time of the queries. We determine three different impacts         both systems by our adjustments for these queries.
within our study.                                                    Our assumption is confirmed that column stores cannot
   First, we cannot figure out an impact of our adjustments       outperform row stores at each query, i.e., there are applica-
on the query execution of MySQL. There are only some              tion fields for row and column stores in the DWH domain.
queries that show a negligible impact, e.g., the query ex-
ecution times of Q2 and Q5 are only increased by 3 seconds
with respect to the overall query execution time above 1          5.   RELATED WORK
minute. We expect this behavior because row stores do not             Several open- and closed-source column stores have been
have drawbacks while processing entire tuples. We assume,         released [1, 8, 13, 18] but all systems are pure column stores
the final projection within a query plan has no impact on         and do not support any row store functionality. We state
query execution time for row stores because unnecessary at-       that all application fields in the DWH domain cannot be
tribute will only be cropped from the result sets. We assume,     satisfied by systems that support only one architecture.
this fact does not refer to projections on intermediate result        Regarding the solutions for architectural problems, there
sets during the optimization process.                             are several approaches [1, 12] available which try to reduce
   Second, ICE shows an obvious impact on several queries,        the drawbacks caused by the architecture, i.e., these ap-
e.g., Q3 and Q19. The increased costs for these queries show      proaches replicate data to overcome the drawbacks. We
the influence of the size of processed tuples to the perfor-      want to figure out the most suitable architecture for a given
mance of column stores. We assume, these costs are caused         application/workload without replication mechanisms.
by the necessary tuple reconstruction during the query ex-            We need to analyze all workloads to figure out the appli-
ecution. Hence, analyses on larger tuples within result sets      cations fields for row and column stores. Therefore, we can
can corrupt the performance of column stores. We state            utilize, adapt and extend existing approaches such as Turby-
that large tuple sizes within a result set cannot disregarded     fill [15] who considers mixed workloads or Raatikainen [10]
for storage architecture decision, especially not for analyzing   considers workload clustering analysis. In contrast, our ap-
and reporting tools that process on huge data sets.               proach needs to classify and analyze database operations
   Third, some queries do not show an impact according to         itself. The approaches of Favre et al. [6] and Holze et al. [7]
our adjustments of the TPC-H benchmark e.g., Q7 or Q15.           step in the direction of self-tuning databases, i.e., they con-
1 SELECT * , COUNT(DISTINCT ps_suppkey ) AS supplier_cnt
2 FROM partsupp , part
3 WHERE p_partkey = ps_partkey AND p_brand < > ’ Brand #51 ’ AND p_type NOT LIKE ’ SMALL PLATED % ’
4 AND p_size IN (3 , 12 , 14 , 45 , 42 , 21 , 13 , 37) AND ps_suppkey NOT IN (
5         SELECT s_suppkey FROM supplier WHERE s_comment LIKE ’% Customer % Complaints % ’)
6 GROUP BY p_brand , p_type , p_size
7 ORDER BY supplier_cnt DESC, p_brand , p_type , p_size ;

                                        Listing 1: Adjusted TPC-H Query Q16


sider analyses on changing workload. We could adopt these         [4] D. J. Abadi, D. S. Myers, D. J. DeWitt, and
approaches to develop an alerter in architectural manner to           S. Madden. Materialization strategies in a
advice the redesign of a system.                                      column-oriented DBMS. In ICDE, pages 466–475,
   The current research reflects new approaches to solve the          2007.
update problems of OLAP applications, e.g., dimension up-         [5] S. Chaudhuri and V. Narasayya. Self-tuning database
dates [16]. Moreover, the update problem is increased ac-             systems: A decade of progress. In VLDB ’07, pages
cording to new demands like real-time DWH [11, 17]. Thus,             3–14. VLDB Endowment, 2007.
solutions are needed to overcome update processing prob-          [6] C. Favre, F. Bentayeb, and O. Boussaid. Evolution of
lems.                                                                 data warehouses’ optimization: A workload
                                                                      perspective. In DaWaK ’07, pages 13–22, 2007.
6.   CONCLUSION                                                   [7] M. Holze, C. Gaidies, and N. Ritter. Consistent
   We illustrated the major differences in storage form and           on-line classification of DBS workload events. In
functionality between column and row stores. Therefrom,               CIKM ’09, pages 1641–1644, 2009.
we showed the difficulty to compare the performances of           [8] T. Legler, W. Lehner, and A. Ross. Data mining with
the two architectures or systems with different architectures.        the SAP NetWeaver BI Accelerator. In VLDB ’06,
Consequently, we discussed the increased complexity of the            pages 1059–1068. VLDB Endowment, 2006.
physical design process for DWH applications, e.g., distinc-      [9] H. Plattner. A common database approach for OLTP
tion between two architectures. We mentioned several ap-              and OLAP using an in-memory column database. In
proaches that show the necessity of column stores as well             SIGMOD ’09, pages 1–2, New York, NY, USA, 2009.
as row stores or at least row store functionality within the          ACM.
DWH domain. Afterwards, we analyzed the impact of cer-           [10] K. E. E. Raatikainen. Cluster analysis and workload
tain operations on both architectures and figured out that            classification. SIGMETRICS Performance Evaluation
current workload analysis approaches are not sufficient. The          Review, 20(4):24–30, 1993.
current analysis approaches cannot analyze workloads for         [11] R. J. Santos and J. Bernardino. Real-time data
both architectures sufficiently because we proposed that work-        warehouse loading methodology. In IDEAS ’08, pages
load analysis should analyze workloads based on operations            49–58, New York, NY, USA, 2008. ACM.
instead of entire queries. We argued that this approach          [12] J. Schaffner, A. Bog, J. Krüger, and A. Zeier. A
can satisfy the necessity to evaluate performance of systems          hybrid row-column OLTP database architecture for
across different architectures.                                       operational reporting. In BIRTE ’08, 2008.
   Our study with ICE and MySQL shows that column stores         [13] D. Ślȩzak, J. Wróblewski, V. Eastwood, and P. Synak.
cannot outperform row stores for every workload even ICE is           Brighthouse: an analytic data warehouse for ad-hoc
implemented for DWH applications and vice versa MySQL                 queries. PVLDB ’08, 1(2):1337–1345, 2008.
is not. Hence, we confirm our assumption that there are ap-      [14] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen,
plication fields for column and row stores within the DWH             M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden,
domain. Thus, a framework for the selection of optimal ar-            E. J. O’Neil, P. E. O’Neil, A. Rasin, N. Tran, and
chitecture is necessary. Our study shows the necessity to             S. B. Zdonik. C-Store: A column-oriented dbms. In
adopt current approaches for more significant performance             VLDB ’05, pages 553–564, 2005.
estimations. Finally, we discuss the impact of certain oper-     [15] C. Turbyfill. Disk performance and access patterns for
ations on the overall performance of a query.                         mixed database workloads. IEEE Data Engineering
                                                                      Bulletin, 11(1):48–54, 1988.
7.   REFERENCES                                                  [16] A. A. Vaisman, A. O. Mendelzon, W. Ruaro, and
 [1] D. J. Abadi. Query execution in column-oriented                  S. G. Cymerman. Supporting dimension updates in an
     database systems. PhD thesis, Cambridge, MA, USA,                OLAP server. Information Systems, 29(2):165–185,
     2008. Adviser: Madden, Samuel.                                   2004.
 [2] D. J. Abadi, P. A. Boncz, and S. Harizopoulos.              [17] Y. Zhu, L. An, and S. Liu. Data updating and query
     Column oriented database systems. PVLDB ’09,                     in real-time data warehouse system. In CSSE ’08,
     2(2):1664–1665, 2009.                                            pages 1295–1297, 2008.
 [3] D. J. Abadi, S. R. Madden, and N. Hachem.                   [18] M. Zukowski, P. A. Boncz, N. Nes, and S. Heman.
     Column-stores vs. row-stores: How different are they             MonetDB/X100 - a DBMS in the CPU cache. IEEE
     really? In SIGMOD ’08, pages 967–980, New York,                  Data Engineering Bulletin, 28(2):17–22, June 2005.
     NY, USA, 2008. ACM.

</pre>