<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Comprehensive Study of Late Materialization Strategies for a Disk-Based Column-Store</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>George Chernishev</string-name>
          <email>chernishev@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viacheslav Galaktionov</string-name>
          <email>viacheslav.galaktionov@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valentin Grigorev</string-name>
          <email>valentin.d.grigorev@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Evgeniy Klyuchikov</string-name>
          <email>evgeniy.klyuchikov@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kirill Smirnov</string-name>
          <email>kirill.k.smirnov@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Saint Petersburg State University</institution>
          ,
          <addr-line>Saint-Petersburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>By allowing operations on positions (row IDs, ofsets), columnstores increase the overall number of admissible query plans. Their plans can be classified into a number of so-called materialization strategies, which describe the moment when positions are switched to tuples. Despite being a well-studied topic with several diferent implementations, there is still no formal definition for it, as well as no classification of existing approaches. In this paper we review and classify these approaches. Our classification shows that, for disk-based systems, none of the existing implementation variants eficiently combines position manipulation inside both selections and joins. For this reason, we propose such an approach which we name “ultra-late materialization”. Further, we describe recent modifications of PosDB - a distributed, disk-based column-store. These modifications allowed us to implement a flexible query processing model. Relying on it, we have implemented a number of late materialization variants, including our approach. Finally, we empirically evaluate the performance of ultra-late materialization and classic strategies. We also compare it with two industrial-grade disk-based systems: PostgreSQL and MariaDB Column Store. Experiments demonstrate that our variant of late materialization outperforms the closest competitor (MariaDB Column Store) by 50% which makes further investigation worthwhile.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Unlike row-stores, column-stores can operate not only on data,
but on positions as well (or row IDs, ofsets, etc.) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Usually,
positions are employed on the lower levels of query plan, e.g., if
there is a very selective predicate on some column, then it may
be worthwhile to evaluate it first, and then perform positional
lookup for values of other attributes.
      </p>
      <p>Operating on positions opens a number of diferent query
processing strategies centered around so-called “materialization”.
Roughly speaking, materialization is the process of switching
positions back into values. Materialization eventually has to
happen at some point during query execution, since the user needs
tuples instead of positions.</p>
      <p>
        The literature presents two main strategies: early (EM) and
late materialization (LM). The basic idea of LM is to operate on
positions and defer tuple reconstruction for as long as possible.
This strategy is capable of significantly conserving I/O bandwidth
and reducing the CPU processing load for appropriate queries. It
was extensively studied in the earliest of the new-wave
columnstores [
        <xref ref-type="bibr" rid="ref21 ref33">21, 33</xref>
        ] and was deemed useful in cases when a lot of
tuples are discarded by plan operators. However, the majority of
modern industrial column-stores do not employ this strategy and
instead rely on EM. The idea of this approach can be described
as not allowing positions past filter operators. The reasons for
this are the complexity of implementing an optimizer and issues
related to processing disk-spilling joins.
      </p>
      <p>However, there is no formal definition of late materialization;
it is more of a paradigm than a fixed set of rules. Over the years,
there were multiple proposals which concerned position-based
query processing with LM support, each involving diferent
techniques, use-cases, and aims. Despite late materialization being a
well-established topic, there is motivation to study the various
approaches:
• More than ten years have passed since the last of
materialization experiments. There are various accumulated
changes in the hardware, such as the improvement of SSDs
and appearance of novel types of storage, such as NVMe.
• Furthermore, our review demonstrates that the research
landscape concerning the processing of joins in a late
materialization environment is incomplete and needs
expansion. The majority of reviewed studies addressed queries
containing not more than a single join.
• There are novel applications of position enabled-processing
such as systems for visual analytics.</p>
      <p>Thus, there is a strong call for taking a second look into
materialization.</p>
      <p>In this paper we review existing approaches in order to
compare and classify them. For this, we examine query plans and
study position-related techniques that they employ. As the
result, we come to the conclusion that there are two types of late
materialization: late materialization in selection parts of the plan
and late materialization in joins. We demonstrate that there are
yet to be studies on how to eficiently combine these approaches
for disk-based systems.</p>
      <p>Following this, we propose the ultra-late materialization
approach which eficiently combines both types of late
materialization techniques.</p>
      <p>Next, we describe the recent modifications of PosDB which
allowed us to greatly expand its query processing model. It now
ofers substantial flexibility in terms of admissible query plans
and makes it possible to implement diferent types of
materialization strategies. We have implemented the proposed ultra-late
materialization as well as several other materialization approaches.</p>
      <p>Finally, we evaluate these strategies, comparing them with
each other and running additional experiments with two
industrial DBMSes, one of which is a column-store.</p>
      <p>Overall, the contributions of this paper are the following:
(1) A discussion of LM approaches, examination of their
supported query plans, their comparison and classification.
(2) A proposal for ultra-late materialization: a
materialization strategy that allows to operate on positions in both
selections and joins.
(3) A description of PosDB architecture and its recent
modifications which made it possible to implement a number of
diferent materialization strategies.
(4) An experimental evaluation of these strategies inside PosDB
and additionally using two industrial systems — row- and
column-store, using the Star Schema Benchmark.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND, RELATED WORK, AND</title>
    </sec>
    <sec id="sec-3">
      <title>MOTIVATION</title>
    </sec>
    <sec id="sec-4">
      <title>What are Early and Late Materialization?</title>
    </sec>
    <sec id="sec-5">
      <title>A Classification of Approaches</title>
      <p>
        Materialization is a process of combining data from individual
columns into wide records [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. There are several possible
materialization strategies, each of which defines its own query plans
and types of employed operators.
      </p>
      <p>
        There are two types of approaches: early materialization and
late materialization. The first one is adopted by many so-called [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
naive column-stores which only store data in columnar format
and use this feature only to read a subset of columns which are
necessary for a particular query. During query processing, they
immediately decompress data, form rows and then proceed to
evaluate query similarly to row-stores. Therefore, they do not
process data in columnar form.
      </p>
      <p>Although early materialization provides some benefits — lower
volumes of data read from disk (compared to row-stores), it is
possible to do even better. The idea of late materialization is to
defer record reconstruction to a later moment. The goal is to find
a better query plan which becomes available in the context of
this strategy. Usually, implementing late materialization requires
additional eforts such as devising novel query processing models
and operators.</p>
      <p>Unlike early materialization systems, late materialization ones
not only store, but also process columnar data, aggressively using
positions inside the query executor. The idea is to conserve disk
bandwidth using knowledge of selectivities.</p>
      <p>We can classify all studies by the place of query plan which
employs positions. Query plans have a fairly regular structure:
selections are placed on the bottom and joins usually reside above.
The capability to employ positions inside each of these parts
corresponds to a variant of late materialization. Thus, there are
two places where late materialization can be implemented: inside
selections and inside joins.</p>
      <p>Late materialization has been studied for several decades.
However, it was successfully implemented only in a handful of studies.
We will review them below.
2.2</p>
    </sec>
    <sec id="sec-6">
      <title>Early Days of Late Materialization: Works of G. Copeland et al.</title>
      <p>
        One of the earliest studies that considered a variant of late
materialization was a series of papers by G. Copeland et al. [
        <xref ref-type="bibr" rid="ref14 ref22">14, 22</xref>
        ],
published in the 80’s. The authors proposed a query execution
approach that took into account positions in the table, which they
called surrogate attributes. Data was stored in two-column tables
made up from records containing &lt;surrogate attribute, value&gt;
pairs.
      </p>
      <p>
        To represent joins, the authors employed a join index [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ].
However, unlike modern systems, they proposed to pre-compute
and store the join indexes with the rest of the data on disk.
      </p>
      <p>The central idea of the approach is to process sets consisting
of two-column tables which may be either data columns with
assigned pivot attributes or join indexes. The processing is based
on a pivot surrogate — a column which other columns are joined
to using an N-ary join.</p>
      <p>The overall process consisted of several phases:
(1) The Selection phase, during which predicates are applied
to individual tables. This process resulted in two types of
intermediates: the ones which contained only surrogate
attribute and &lt;surrogate attribute, value&gt; records.
(2) The Pivot phase, during which results from individual
tables are “stitched” together. For this, intermediates
obtained in the previous phase and stored join indexes are
joined via an N-ary join. An optimal pivot surrogate is
selected and all other itermediates are joined to it. This
phase results in a set of two-column tables consisting of
&lt;pivot surrogate, surrogate attribute&gt; records.
(3) The Materialization phase, which processes each table
individually: each attribute necessary to answer the query
is read from disk to construct a set of two-column tables
containing &lt;pivot surrogate, value of projected attribute&gt;
records.
(4) The Composition phase uses the N-ary join of
intermediates obtained in the previous phase to construct records.</p>
      <p>Thus, this approach may be considered as a variant of late
materialization which operated on positions inside selections
and joins. However, one may say that it was not pure late
materialization, but hybrid instead, since after the value materialization
phase the query executor continued to operate on both positions
(pivot surrogate) and values.</p>
      <p>The proposed processing scheme was rigid, ofered little to no
opportunities of query optimization, but more importantly, it did
not explore available options of performing selections the way
C-Store did, which is discussed further.</p>
      <p>
        These drawbacks, at least partly, come from the fact the paper
was published several years before the release of the famous
Volcano [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] paper.
2.3
      </p>
    </sec>
    <sec id="sec-7">
      <title>Late Materialization in Fractured Mirrors</title>
      <p>
        The Fractured Mirrors paper [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] also explored late
materialization. Its authors designed a system in which row- and
columnstore data layouts were combined. They used TPC-H [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] queries
to benchmark and study plans produced by their approach.
      </p>
      <p>The core of the approach is the DSMScan operator, which is a
specialized variant of multi-way merge-join. To demonstrate the
produced plans, the authors used several queries, two of which
we will consider. They are presented in Figure 1.</p>
      <p>The first one is Q1* which is essentially the TPC-H Q1 query,
in which the predicate was inverted to make it more selective.</p>
      <p>The leftmost shows the row-store mode plan. There, query
executor performs an index scan and sends qualifying tuples
into the aggregation operator. The columnar plan has a modified
index scan operator that returns record ids, which are then fed
into six columnar access operators. DSMScan orchestrates them
and creates records which are passed to the aggregation operator.
Therefore, this query plan has late materialization in the selection
part.</p>
      <p>The right part of Figure 1 contains plans for the TPC-H Q19,
which consists of selections and a single join. It shows two DSM
plans — with early and late materialization. In case of early
materialization, tuples are formed at the bottom levels of the plan and
then joined with the tuple-based join. The late materialization
plan is diferent: at the first step, only four out of six attributes
belonging to LineItem (LItem in the figure) table are accessed.
Then the tuple-based join is run and the remaining two attributes
are read from disk. Therefore, one may say that for l_extprice and
l_discount late materialization is performed. However, in order
to perform reaccess, positions are kept together with values after
the join. This way, it is more of a hybrid than late materialization
technique.</p>
      <p>Therefore, this variant includes late materialization in
selections and partially in joins. However, similarly to the approach
described in Section 2.2 it lacked C-Store-like scans and it did
not deeply explore join performance. In this study, only a single
considered query contained more than one join.
2.4</p>
    </sec>
    <sec id="sec-8">
      <title>FlashJoin: late materialization inside</title>
    </sec>
    <sec id="sec-9">
      <title>PostgreSQL</title>
      <p>
        In the previous study, joins “captured” some of attributes that
were materialized earlier. An alternative approach was proposed
by the HP Labs research group [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ].
      </p>
      <p>They proposed a new join operator called FlashJoin consisting
of two components: fetch kernel and join kernel. The join kernel
will process only the two necessary attributes, one for each table.
The goal of the fetch kernel is to read attributes necessary for
further query processing.</p>
      <p>An example of this processing scheme is presented in Figure 2
(adapted from the original paper).</p>
      <p>In this query, the first operator joins 1 and 2 using the
1. = 2. condition. The result of its join kernel is the list of
position pairs, one for each table. The next join requires the 
attribute, therefore the fetch kernel reads it. In a similar manner,
the next operator performs the join. However, this time the fetch
Select R1.B, R1.C,
R2.E, R2.H, R3.F
From R1, R2, R3
Where R1.A = R2.D
AND R2.G = R3.K</p>
      <p>G</p>
      <sec id="sec-9-1">
        <title>Project</title>
      </sec>
      <sec id="sec-9-2">
        <title>Join A=D</title>
      </sec>
      <sec id="sec-9-3">
        <title>Project</title>
      </sec>
      <sec id="sec-9-4">
        <title>Join G=K</title>
        <p>K</p>
        <p>F
C</p>
        <p>B</p>
        <p>A</p>
        <p>D</p>
        <p>G</p>
        <p>E</p>
        <p>H
kernel reads five attributes, belonging to all three tables. Thus,
this approach supports late materialization inside join sequences.</p>
        <p>The approach was implemented inside the PostgreSQL system
and evaluated in an SSD environment. However, their
experiments used queries featuring no more than two joins. Finally,
this study did not consider late materialization inside selections.
2.5</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Late materialization in C-Store</title>
      <p>
        In the C-Store [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] system, late materialization was implemented
diferently. It relied on a special data structure: a multi-column
block. It describes a consequent horizontal fragment of a table,
transferred between operators in a Volcano-style manner. Each
block contains [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]:
• a position descriptor which indicates the status of each
record — valid (i.e. satisfies the predicate) or not, and
• values of several columns, each one of which may be
compressed by its own algorithm.
      </p>
      <p>
        The multi-column block made it possible to implement late
materialization inside selections [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Consider, for example, a
query which scans the LineItem table (the SSB [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] one) and selects
records consisting of  and ℎ attributes which
satisfy ℎ &lt;  1    &lt;  2 condition.
      </p>
      <p>The paper discussed four approaches to evaluating this query,
which are presented in Figure 3.</p>
      <p>The first plan uses the DS2 operator to read the ℎ
column and filter it to produce positions with the corresponding
values. Next, DS4 fetches values from the  column using
the position, applies the predicate and constructs records that
satisfy it. The second plan does not use positions at all, it simply
reads both columns synchronously and constructs the result.
Thus, query executor reads each column exactly once.</p>
      <p>The paper called the first two plans early materialization, but
in fact the first one is a hybrid one since it operates both on
positions and values.</p>
      <p>The last two plans belong to the late materialization class. The
third plan at first obtains positions satisfying each predicate, then
intersects them and uses them to read the corresponding values.
In the end, they are merged to construct records. The fourth
plan filters ℎ first, and then resulting positions are used
to read  and filter using the corresponding predicate.
These refined positions are used to read values, and then a record
is constructed similarly to the previous plan. The fourth plan is
beneficial when the ℎ predicate is very selective, which
makes it possible to read a very small number of blocks from the
 column.</p>
      <p>
        The C-Store was oriented towards projection-based processing.
Projection is a pre-joined collection of columns that may involve
multiple logical tables. For this reason C-Store preferred joinless
queries, although it was possible to run joins between projections
using pre-computed join indexes. Later studies [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] revealed that
this idea was abandoned due to excessive computing costs.
      </p>
      <p>
        A subsequent paper [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] evaluated late materialization in joins,
but using only a single query and this query had only a single join.
One of the reasons was the use of an unsuitable data structure
(multi-column) to store post-join intermediates which restricted
further experiments.
2.6
      </p>
    </sec>
    <sec id="sec-11">
      <title>Late materialization in MonetDB family</title>
      <p>
        The MonetDB [
        <xref ref-type="bibr" rid="ref10 ref21">10, 21</xref>
        ] system ofers the largest number of
options of materialization processing. Its idea is to operate on BATs
(Binary Association Tables) — special tables that consist of one or
two columns. Each column may contain either data itself (values)
or positions. All MonetDB operators take BATs as input and also
output BATs, thus forming an algebra.
      </p>
      <p>
        In this query processing model, late materialization naturally
appears both in selections and in joins [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]:
• After a selection, the result is a BAT containing positions.
• Join processing also results in BATs that contain positions.
      </p>
      <p>
        Moreover, MonetDB/X100 [
        <xref ref-type="bibr" rid="ref19 ref9">9, 19</xref>
        ] ofered another late
materialization option — selection vectors. The idea is to select a subset
of data using a vector of positions inside some operator, without
a dedicated filtration phase.
      </p>
      <p>
        However, MonetDB is an in-memory system and its query
processing model is defined by this fact. Likewise, its late
materialization techniques are dependent on having all data in memory
and thus have limited application in disk-based systems.
• They either support late materialization in joins only, but
not in selections [
        <xref ref-type="bibr" rid="ref29 ref34">29, 34</xref>
        ], or they support a rich operator
set for selections but fall behind in joins [
        <xref ref-type="bibr" rid="ref18 ref33 ref6">6, 18, 33</xref>
        ].
• They do not follow [
        <xref ref-type="bibr" rid="ref14 ref22">14, 22</xref>
        ] the now standard Volcano
processing scheme, and thus are inapplicable in contemporary
systems.
• They employ in-memory processing [
        <xref ref-type="bibr" rid="ref10 ref18 ref21">10, 18, 21</xref>
        ] and thus
are inapplicable for disk-based DBMSes. Despite the boom
of in-memory systems, there is still plenty of use-cases for
the disk-based ones.
      </p>
      <p>The summarized state of afairs in the area of late
materialization is presented in Table 1. There, we describe how well late
T1
row1
row2
row3
T1
1
2
3</p>
      <p>Sort
local access
remote access
materialization is elaborated inside selections and joins, and also
system type (main memory or disk-based).</p>
      <p>Therefore, a new disk-oriented late materialization model is
needed, which will:
(1) support C-Store-style late materialization in selections,
(2) support late materialization inside joins and will allow
handling arbitrary number of joins.</p>
      <p>In this paper, we propose such an approach which we call
ultra-late materialization. We implement it inside PosDB — a
distributed disk-based column-store engine.</p>
      <p>We run an extensive experimental study featuring several
existing materialization approaches. The goal is to compare them
with each other and study the resulting performance.</p>
      <p>
        Finally, it is important to discuss three other “general” reasons
for the re-evaluation of late materialization approaches:
• More than ten years passed since the last experiments
with late materialization. Except a single paper [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ], these
experiments considered HDDs, state-of-the-art hardware
of that time. Nowadays, SSDs are ubiquitous, and their
characteristics, in particular lower random access time,
may seriously improve the performance of late
materialization inside joins. This may help overcome the out-of-order
probing problem, which arises during late materialization
inside joins. Modern SSDs have significantly reduced costs
of disk seeks, which may provide eficient reading of
values by positions even when position lists are not fully
sorted.
• There is a resurgence of interest in late
materializationenabled systems, driven by the need of supporting
provenance in systems for visual analytics [
        <xref ref-type="bibr" rid="ref27 ref28 ref37">27, 28, 37</xref>
        ]. A typical
use-case is as follows: a user runs a query, visualizes it as
a bar chart, and then wants to “zoom in” — that is,
perform additional actions with a subset of visualized data
which they select with a mouse. In such scenarios, the
query executor needs to obtain records that contributed
to the selected buckets and thus, position-oriented query
processing is useful [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ]. Therefore, an eficient late
materialization model may improve performance of such
applications.
• Position-enabled processing will be extremely useful for
implementing functional dependency predicates inside
queries [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. A number of promising operators is described
in a draft [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. During their implementation, position-enabled
query processing will play a vital role, since a lot of
dependency discovery algorithms rely on partitions [
        <xref ref-type="bibr" rid="ref26 ref7">7, 26</xref>
        ],
which are essentially positions.
3
      </p>
    </sec>
    <sec id="sec-12">
      <title>SYSTEM ARCHITECTURE</title>
      <p>
        The PosDB [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] project started in 2017 as a research efort to
study late materialization-oriented query processing. We focused
on query processing issues not studied in PosDB’s predecessors,
namely: LM and processing of aggregation queries (including
window function processing), distributed processing in LM
environments, subqueries and LM, etc. We were also interested in
solving the problem of disk-spilling joins given the advancement
of hardware (mostly SSDs) that happened since the time of the
pioneers. In short, PosDB is a distributed and parallel column-store
oriented at analytical processing in a disk-based environment.
      </p>
      <p>
        As of 2017, query processing in PosDB was organized
according to the pull-based Volcano model [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] with block-oriented
processing. This model assumes that query plans are represented
via trees that have operators as nodes and data flows as edges.
Each operator supports an iterator interface and can produce
either positions or tuples exclusively.
      </p>
      <p>
        To represent intermediate positional data, PosDB uses a
generalized join index [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]: a data structure that essentially encodes
the result of a series of joins. The join index states that row 1 of
table 1 was joined with row 2 of table 2 and so on. An example
is presented in Figure 4a. Past the materialization point, PosDB
represents data in tuples, similarly to row-stores.
      </p>
      <p>The query plans in classic PosDB contain two parts: positional
(columnar) data and tuples. Operators that manipulate positions
(joins, filters, positional set operators) use join indexes to
represent intermediate results, and the tuple part (aggregation
and sort operators) is similar to row-stores. An example plan
is presented in Figure 4b, where the dashed line denotes the
materialization point (or the Tuples-Columns border).</p>
      <p>Several operators such as the positional set AND and OR need
only positional data, while others, i.e. join, filter, and
aggregation also require the corresponding values. To fetch values
using positions from the join index, we have introduced special
entities named readers. For example, ColumnReader is designed
to retrieve values of a specific attribute, and SyncReader is
designed to read values of several attributes synchronously.</p>
      <p>
        PosDB is not only distributed, but also parallel: it includes the
Asynchronizer and UnionAll operators. The former allows to
start executing an operator subtree in a separate thread, similarly
to the Exchange operator [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The latter enables the collection
of data from several subtrees, all executed in their own threads,
efectively providing data parallelism inside query plans.
      </p>
      <p>
        A detailed description of the baseline architecture can be found
in paper [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Since 2017, we have been exploring various
aspects of LM-oriented processing: aggregation [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ], window
functions [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], compression [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], intermediate result caching [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
distributed processing [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Recently, the engine has been significantly extended. First of
all, we have improved the disk-based capabilities by introducing
a proper bufer manager and reworking access methods. This
moved PosDB closer to industrial systems and allowed us to
obtain higher-quality experimental data.</p>
      <p>
        Next, we have extended the support of partitioning by adding
hash- and range-based partitioning methods. We have also switched
from a column-based partitioning scheme to a row-based one
due to the former’s excessive complexity of maintenance. On top
of that, we have added compression support.
Query processing faced the most extensive modifications. First
of all, truly distributed operators like distributed join [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and
distributed aggregation [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] have been implemented. Next,
we have implemented a number of novel tuple-based operators
that significantly extended the number of allowed query plan
types.
      </p>
      <p>For the current study, the most important addition was the
introduction of a tuple-based join, filter, cross-product, and
aggregation operators. This allowed to greatly expand the
number of admissible query plans and made possible an
implementation of a several late materialization strategies. Unlike PosDB’17,
now there may be multiple materialization points inside the query
plan, as shown in Figure 5. In this figure, the top join and the
filter are tuple-based operators. Thus, it is now possible to
construct query plans that may have tuple data representation
inside subplans. This may be beneficial in cases when running
the whole query using late materialization is suboptimal.</p>
      <p>Finally, we have laid foundations for hybrid materialization
(HM) by introducing a special intermediate representation. For
now, it is supported only in single-tabled parts of a plan.</p>
    </sec>
    <sec id="sec-13">
      <title>QUERY EVALUATION STRATEGIES</title>
      <p>From the query processing standpoint, the modifications that
PosDB has undergone over the years have allowed us to
implement several diferent approaches to materialization:
(1) early materialization,
(2) late materialization,
(3) hybrid materialization,
(4) ultra-late materialization.</p>
      <p>
        The first two are the classic approaches, described in the early
10’s [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. They will be used as the baselines. Hybrid
materialization is our recent development, aimed specifically at solving a
particular issue that arises in the distributed case during
positionbased query processing. We do not include it in the benchmark
since it does not fit into the evaluation scenario. Nevertheless, we
believe it is important to mention it here, since it demonstrates an
important issue that will inevitably arise in any position-based
query processing model run in a distributed environment. Finally,
ultra-late materialization is the core approach discussed in this
paper.
      </p>
      <p>
        Consider, for example, query Q.11 of the Star Schema
Benchmark [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]:
      </p>
      <sec id="sec-13-1">
        <title>SELECT</title>
        <p>SUM( l o _ e x t e n d e d p r i c e ∗ l o _ d i s c o u n t ) AS r e v e n u e</p>
      </sec>
      <sec id="sec-13-2">
        <title>FROM</title>
        <p>l i n e o r d e r , d a t e</p>
      </sec>
      <sec id="sec-13-3">
        <title>WHERE</title>
        <p>l o _ o r d e r d a t e = d _ d a t e k e y AND
l o _ d i s c o u n t BETWEEN 1 AND 3
d _ y e a r = 1 9 9 3 AND</p>
        <p>AND l o _ q u a n t i t y &lt; 2 5 ;</p>
        <p>This query has a join, three predicates, and an aggregation.
Query processing model available in PosDB allows to produce
several diferent query plans for each of materialization strategies.
Let us consider them.
4.1</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>Ultra-Late Materialization</title>
      <p>Operator</p>
      <p>Reader</p>
      <p>As stated in Section 2.8 the idea of the ultra-late materialization
is to combine late materialization in selections and joins. This
makes possible operating on positions and thus deferring tuple
reconstruction for as long as possible. Therefore, such plans
should perform all filters and joins below the materialization
point, thus using positional representation (join indexes).</p>
      <p>The ultra-late materialization plan for the example query is
presented in Fig. 6. At the first step, two position lists for the
LINEORDER table are obtained by applying predicates to each
of the involved columns. This is achieved via using Filter
operators with readers corresponding to the involved columns.</p>
      <p>Next, the PosAND operator is used to intersect position lists in
order to obtain IDs of LINEORDER records that conform to both
predicates from the query.</p>
      <p>The DATE table is processed in the same way, but using only
a single Filter operator which works with the d_year column.</p>
      <p>The Join operator is located higher up in the query plan.
Note that this is a positional join operator, i.e. it operates on
join indexes. In this particular case, they are one-dimensional, i.e.
plain position lists.</p>
      <p>Despite being positional, this join still requires the data values
themselves. Therefore, it needs readers for the corresponding
columns which can be seen in the figure.</p>
      <p>This operator produces a list of position pairs, each
corresponding to LINEORDER and DATE respectively. The list is fed
into the aggregation operator which also materializes the final
result.
4.2</p>
    </sec>
    <sec id="sec-15">
      <title>Early Materialization</title>
      <p>
        Despite being well-known [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], early materialization is a query
evaluation strategy that has been implemented in PosDB since
only recently. The idea of this strategy is to imitate a classic
rowstore: at first, a tuple representation is built using only the
necessary attributes and then all required operators are run. Therefore,
iflters, joins and other operators are performed on the tuple-based
representation.
      </p>
      <p>Fig. 7 contains the EM plan for the considered query. It
possesses several distinctive features:
• Tuple materialization happens at the bottom of the plan,
i.e. materialization operators are essentially the first
nontrivial ones.
• In a EM strategy, readers are present only in
materialization operators — all other operators are tuple-based. The
Tuples-Columns border resides on the lowest level among
all considered strategies.</p>
      <p>Finally, note that in this strategy only attributes necessary for
the further query execution are materialized, i.e. not every one
of the corresponding tables.
4.3</p>
    </sec>
    <sec id="sec-16">
      <title>Late Materialization</title>
      <p>Another baseline approach was the C-Store-style late
materialization in selections. As described in Section 2.5, its core idea
concerns operating on positions in the bottom parts of the plan
only. That is, predicates are evaluated using join index
representation, but costlier operators such as joins are evaluated using
the classic tuple-based representation.</p>
      <p>The LM plan of Q.11 is presented in Fig. 8. It can be seen that
in this query evaluation strategy, predicates are evaluated in
the positional part of the plan. Similarly to the ultra-late
materialization plan, individual position lists for LINEORDER table
are intersected. However, an operator of explicit materialization
is run next. It transforms the join-index representation into a
tuple-based one. Note that, similarly to the EM approach, this
tuple representation contains only those attributes that would
be requested by subsequent plan operators.</p>
      <p>Unlike the EM strategy, which materialized four attributes,
only three attributes will be materialized for the LINEORDER
table: lo_orderdate, lo_discount, and lo_extendedprice. The
lo_quantity attribute will not be materialized, since LINEORDER has
already been filtered by the corresponding predicate and
subsequent plan operators do not need this attribute.</p>
      <p>Concerning the DATE table, only the d_datekey attribute will
be materialized. Therefore, higher plan levels will be reached
only by two attributes of limited use: lo_orderdate and d_datekey.
Both will be discarded immediately after the join operator since
they do not belong to the answer. At the same time, they should
be materialized, since they are needed to perform this join.</p>
      <p>Finally, one can see that tuple-based part of the plan contains
aggregation and join operators. Both of them are tuple-based
and therefore contain no readers.
4.4</p>
    </sec>
    <sec id="sec-17">
      <title>Hybrid Materialization</title>
      <p>
        Intensive data reaccess within a query plan is an inherent
characteristic of the late materialization approach. The existing
papers usually consider the local case [
        <xref ref-type="bibr" rid="ref3 ref30 ref6">3, 6, 30</xref>
        ], where the induced
overhead can be reduced by an eficient bufer manager. Then,
the advantages of late materialization (compression [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
cacheconsciousness [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) outweigh its drawbacks.
      </p>
      <p>However, in a distributed case, the overhead is significantly
higher, because data has to be reaccessed via network. At the
same time, a network cache is usually less robust than a bufer
manager. Moreover, the common query processing workflow
may require meaningless round-trips of position blocks between
nodes.</p>
      <p>Join(A,B)
request pos
Receive
request pos</p>
      <p>Send
request pos</p>
      <p>Filter(A)</p>
      <p>data
pos
B</p>
      <sec id="sec-17-1">
        <title>Node1</title>
      </sec>
      <sec id="sec-17-2">
        <title>Node2</title>
        <p>A</p>
        <p>Network
Reader
pos</p>
        <p>data
Send
pos data
Consider an execution of a Filter-Join operator sequence in
PosDB using tables A and B, shown in Figure 9a. This operator
sequence is commonly found in query plans. It also has the same
workflow as a necessary step of the reshufling process during a
distributed join.</p>
        <p>With the late materialization approach, the Filter operator
processes table A and passes the resulting position blocks over
the network to  1. Then the Join operator has to return the
same blocks to  2 with a request for table A data. Thus, a
meaningless network communication cycle exists.</p>
        <p>Moreover, in the general case, the Join operator can process
position blocks from a set of Filter operators executed both
locally and on diferent remote nodes. Table A data may be
requested by the next query plan operator again, immediately
after the Join. Therefore, a specific optimization for a particular
“Filter-Join” case is not suficient. Instead, the whole query
execution model should be improved.</p>
        <p>PosDB solves the problem with two major changes: hybrid
materialization support and moving out the data acquisition logic
from operator internals into a separate optimizer-configurable
entity.</p>
        <p>
          First, hybrid materialization [
          <xref ref-type="bibr" rid="ref3 ref6">3, 6</xref>
          ] allows to pass both data
and positions between operators. Thus, the required table A
data is materialized immediately after the Filter operator and
sent to Join together with position blocks, see Figure 9b. The
meaningless position round-trip is eliminated, and the network
communication overhead decreases nearly twice.
        </p>
        <p>
          In existing articles, hybrid materialization is treated as an
one-way step from purely position-based to tuple-based
representation [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. However, the distributed case, which we have just
described, establishes a novel, completely diferent context. And
in this context, reverting to late materialization just after the
Join operator may be preferable.
        </p>
        <p>Another major change is the encapsulation of the data
acquisition process in an independent separate entity. This allows to
combine both late and hybrid materialization when accessing
several replicas located both locally and remotely. Moreover, it
can be done seamlessly to an operator itself due to the unified
interface and C++ template techniques.</p>
        <p>Encapsulating the data acquisition strategy also increases
opportunities for query plan tuning. A query optimizer can freely
choose not only the particular nodes the data is received from,
but also the order and the intensity of their usage. It can be
especially important in the case of a distributed join. The dynamic
reshufling stage may consider several data streams from remote
Multiplexors as well as local data from disk.</p>
        <p>1400
1200
1000
s
ond 800
c
e
s
,e 600
m
i
T
400
200
0</p>
        <p>Ultra Late Materialization (PosDB)
Late Materialization (PosDB)
Early Materialization (PosDB)
PostgreSQL</p>
        <p>MariaDB Column Store
0
20
80</p>
        <p>100
40</p>
        <p>60</p>
        <p>Scale Factor</p>
      </sec>
    </sec>
    <sec id="sec-18">
      <title>EXPERIMENTAL EVALUATION</title>
      <p>
        To study the considered strategies, we have performed a twofold
experimental evaluation. First, we have compared them to each
other within PosDB. Second, we have selected the two most
similar industrial competitors: a disk-based row-store (PostgreSQL)
and a column-store (MariaDB Column-Store). We centered our
baseline around open-source-licensed systems due to the DeWitt
Clause [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], which they are free of. At the same time, they are
mature enough to be used in industrial applications.
      </p>
      <p>Since our evaluation was performed in a centralized
environment, we did not consider the HM strategy.</p>
      <p>
        In order to run our experiments, we have used the Star Schema
Benchmark [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], varying the Scale Factor (SF) in a [
        <xref ref-type="bibr" rid="ref1">1, 100</xref>
        ] range.
Thus, the total data volume ranged from 0, 8 to 80 GB. We have
selected SSB because of its focus on engine comparison instead
of optimizer output. This is achieved by the relative simplicity of
queries: they do not allow much variability in terms of admissible
query plans.
      </p>
      <p>In order to provide an equal environment for all systems, the
following was done during the test runs:
• Data compression, JIT-compilation, SIMD, and indexes
were not used.
• DBMSes were not tuned, default parameters were used.
• Default data plans were used.</p>
      <p>Early Materialization (PosDB)
Late Materialization (PosDB)
Ultra Late Materialization (PosDB)
PostgreSQL
MariaDB</p>
      <p>Early Materialization (PosDB)
Late Materialization (PosDB)
Ultra Late Materialization (PosDB)
PostgreSQL
MariaDB
Q1.1 Q1.2 Q1.3 Q2.1 Q2.2 Q2.3 Q3.1 Q3.2 Q3.3 Q3.4 Q4.1 Q4.2 Q4.3</p>
      <p>Query
(a) Scale Factor 40
(b) Scale Factor 80
• Hash-based versions of joins were used: the smaller table
was put into a hash table.
• Intra- and inter- query parallelism was turned of.
Distributed capabilities were not used.</p>
      <p>All systems were run without warm-up, OS page cache was
dropped before launching each experiment run. Since each run
yielded almost the same numbers, we present results averaged
over ten launches. Experiments were run on a desktop with
the following hardware: AMD Ryzen 9 3900X, GIGABYTE X570
AORUS ELITE, Kingston HyperX FURY Black HX434C16FB3K2/32
32GB, 512 GB SSD M.2 Patriot Viper VPN100-512GM28H.</p>
      <p>Software: Ubuntu 20.04 LTS, GCC 9.3.0, PostgreSQL 12.5,
MariaDB Column-Store 1.5.2 on MariaDB Community Server 10.5.8.</p>
      <p>First of all, we have measured the total runtime on the whole
SSB workload (Fig 10). Here, it can be seen that PosDB’s ultra-late
materialization strategy performs the best with all SF values. The
second best, MariaDB Column Store, is almost 50% slower. The
next three, LM, EM, and PostgreSQL, show more interesting
behavior. In the [0, 40] range, all three of them have approximately
the same run times. However, starting from SF 40 we can see the
following:
(1) Late Materialization is consistently ≈ 5% faster than Early</p>
      <p>Materialization.
(2) PostgreSQL loses about 15% to both Early and Late
Materialization. This happens due to PostgreSQL running out
of memory to cache pages in its bufer manager.
(3) MariaDB Column Store beats PostgreSQL by more than
two times.</p>
      <p>The second experiment provides an in-depth view into the
results by studying the performance of individual queries. We have
selected two important points: SF 40 and SF 80. The first one is
the point where all tuple-oriented strategies show approximately
equal performance, and the second one is the point where the
graphs fully diverge.</p>
      <p>The results are presented in Fig. 11:
• For almost all queries, PosDB’s ultra-late materialization
yields the best result. The only exception are the queries
of the first flight, on which MariaDB Column Store shows
very close or better results.
• Fig. 11b shows that increasing the data size leads to
PostgreSQL losing on the first three query flights out of total
four.
• It is interesting to note that PosDB’s LM and EM strategies
are in-between in terms of performance on all but the
fourth flight. This query flight is the most complex one,
featuring four joins.
6</p>
    </sec>
    <sec id="sec-19">
      <title>FUTURE PLANS</title>
      <p>
        (1) The critical issue for adoption of late-materialization in
the 00’s were disk-spilling joins. LM-induced out-of-order
probing [
        <xref ref-type="bibr" rid="ref3 ref5">3, 5</xref>
        ] killed the performance on HDDs and even
SSDs of that time. Nowadays, hardware and software
specifications have improved, and novel storage such as NVMe
has appeared. All this calls for a second round of
materialization evaluation that will employ joins with tables
larger than main memory.
(2) We plan to further explore Hybrid Materialization not
only because it looks promising for addressing the
diskspilling join problem, but it also ofers query plans that
may improve query performance further.
(3) Extending the system itself: implementing a parser, a query
optimizer, supporting subqueries, indexes, implementing
FD-manipulation predicates [
        <xref ref-type="bibr" rid="ref12 ref8">8, 12</xref>
        ], and so on.
      </p>
    </sec>
    <sec id="sec-20">
      <title>7 CONCLUSION</title>
      <p>In this paper, we reviewed existing approaches to implementing
late materialization in order to compare and classify them. Then
we demonstrated that there are two types of late materialization —
one concerning selections and one concerning joins. We showed
that there were no studies on how to eficiently combine these
approaches for disk-based systems. Following this, we proposed
the ultra-late materialization which makes it possible to
implement both of these approaches inside a single query processing
model.</p>
      <p>Further, we have presented a new version of PosDB.
Modifications that it underwent over the years allowed us to greatly
expand its query processing model. Now it ofers substantial
flexibility in terms of possible query plans and allows to implement
diferent types of late materialization strategies. We have
implemented the proposed ultra-late materialization as well as several
other materialization approaches. Following the description of
basic system architecture and recent modifications, we have
discussed these strategies, using an SSB query as an example.</p>
      <p>Next, in order to study materialization strategies we have
performed experimental evaluation. We have compared them
not only with each other, but with industrial-grade systems as
well — PostgreSQL and MariaDB Column Store. Our experiments
demonstrated that our ultra-late materialization provides 50%
better performance than MariaDB Column Store, and is more
than three times faster than PostgreSQL.</p>
    </sec>
    <sec id="sec-21">
      <title>ACKNOWLEDGMENTS</title>
      <p>We would like to thank Anna Smirnova for her help with the
preparation of the paper.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] [n.d.].
          <source>TPC Benchmark H. Decision Support. Version</source>
          <volume>2</volume>
          .17.1. http://www.tpc. org/tpch.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>2009. P. E. O'Neil</surname>
            ,
            <given-names>E. J. O</given-names>
          </string-name>
          <string-name>
            <surname>'Neil</surname>
            and
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>The Star Schema Benchmark (SSB)</article-title>
          . http://www.cs.umb.edu/~poneil/StarSchemaB.PDF. Accessed:
          <volume>10</volume>
          /09/
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Boncz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Stavros</given-names>
            <surname>Harizopoulos</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>The Design and Implementation of Modern Column-Oriented Database Systems</article-title>
          . Now Publishers Inc., Hanover, MA, USA.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Abadi</surname>
          </string-name>
          , Samuel Madden, and
          <string-name>
            <given-names>Miguel</given-names>
            <surname>Ferreira</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Integrating Compression and Execution in Column-oriented Database Systems</article-title>
          .
          <source>In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD '06)</source>
          . ACM, New York, NY, USA,
          <fpage>671</fpage>
          -
          <lpage>682</lpage>
          . https://doi.org/10.1145/ 1142473.1142548
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Daniel</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Peter A. Boncz</surname>
            , and
            <given-names>Stavros</given-names>
          </string-name>
          <string-name>
            <surname>Harizopoulos</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Columnoriented Database Systems</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>2</volume>
          ,
          <issue>2</issue>
          (Aug.
          <year>2009</year>
          ),
          <fpage>1664</fpage>
          -
          <lpage>1665</lpage>
          . https://doi.org/10.14778/1687553.1687625
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Daniel</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Abadi</surname>
            , Daniel S. Myers,
            <given-names>David J.</given-names>
          </string-name>
          <string-name>
            <surname>DeWitt</surname>
            , and
            <given-names>Samuel</given-names>
          </string-name>
          <string-name>
            <surname>Madden</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Materialization Strategies in a Column-Oriented DBMS</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>ICDE</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rada</surname>
            <given-names>Chirkova</given-names>
          </string-name>
          , Asuman Dogac,
          <string-name>
            <given-names>M. Tamer</given-names>
            <surname>Özsu</surname>
          </string-name>
          , and
          <string-name>
            <surname>Timos</surname>
            <given-names>K.</given-names>
          </string-name>
          Sellis (Eds.). IEEE,
          <fpage>466</fpage>
          -
          <lpage>475</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Tobias</given-names>
            <surname>Bleifuß</surname>
          </string-name>
          , Susanne Bülow, Johannes Frohnhofen, Julian Risch, Georg Wiese,
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Kruse</surname>
          </string-name>
          , Thorsten Papenbrock, and
          <string-name>
            <given-names>Felix</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Approximate Discovery of Functional Dependencies for Large Datasets</article-title>
          .
          <source>In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM '16)</source>
          .
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <fpage>1803</fpage>
          -
          <lpage>1812</lpage>
          . https://doi.org/10.1145/2983323.2983781
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Nikita</given-names>
            <surname>Bobrov</surname>
          </string-name>
          , Kirill Smirnov, and
          <string-name>
            <surname>George</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chernishev</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Extending Databases to Support Data Manipulation with Functional Dependencies: a Vision Paper</article-title>
          . CoRR abs/
          <year>2005</year>
          .07992 (
          <year>2020</year>
          ). arXiv:
          <year>2005</year>
          .07992 https://arxiv.org/ abs/
          <year>2005</year>
          .07992
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Boncz</surname>
          </string-name>
          , Marcin Zukowski, and
          <string-name>
            <given-names>Niels</given-names>
            <surname>Nes</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>MonetDB/X100: Hyperpipelining query execution</article-title>
          . In In CIDR'
          <volume>05</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Peter</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Boncz and Martin L. Kersten</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>MIL Primitives for Querying a Fragmented World</article-title>
          .
          <source>The VLDB Journal 8</source>
          ,
          <issue>2</issue>
          (Oct.
          <year>1999</year>
          ),
          <fpage>101</fpage>
          -
          <lpage>119</lpage>
          . https: //doi.org/10.1007/s007780050076
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>George</surname>
            <given-names>Chernishev</given-names>
          </string-name>
          , Viacheslav Galaktionov, Valentin Grigorev, Evgeniy Klyuchikov, and
          <string-name>
            <given-names>Kirill</given-names>
            <surname>Smirnov</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A Study of PosDB Performance in a Distributed Environment</article-title>
          .
          <source>In Proceedings of the 2017 Software Engineering and Information Management (CEUR Workshop Proceedings)</source>
          , Vol.
          <year>1864</year>
          . CEURWS.org, Saint-Petersburg, Russia.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>George</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chernishev</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Making DBMSes Dependency-Aware</article-title>
          .
          <source>In 10th Conference on Innovative Data Systems Research, CIDR</source>
          <year>2020</year>
          , Amsterdam, The Netherlands,
          <source>January 12-15</source>
          ,
          <year>2020</year>
          ,
          <string-name>
            <given-names>Online</given-names>
            <surname>Proceedings</surname>
          </string-name>
          . www.cidrdb.org. http://cidrdb. org/cidr2020/gongshow2020/gongshow/abstracts/cidr2020_abstract67.pdf
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Chernishev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. A.</given-names>
            <surname>Galaktionov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. D.</given-names>
            <surname>Grigorev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Klyuchikov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. K.</given-names>
            <surname>Smirnov</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>PosDB: An Architecture Overview</article-title>
          .
          <source>Programming and Computer Software</source>
          <volume>44</volume>
          ,
          <issue>1</issue>
          (Jan.
          <year>2018</year>
          ),
          <fpage>62</fpage>
          -
          <lpage>74</lpage>
          . https://doi.org/10.1134/S0361768818010024
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>George</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Copeland</surname>
            and
            <given-names>Setrag N.</given-names>
          </string-name>
          <string-name>
            <surname>Khoshafian</surname>
          </string-name>
          .
          <year>1985</year>
          .
          <article-title>A decomposition storage model</article-title>
          .
          <source>SIGMOD Rec</source>
          .
          <volume>14</volume>
          ,
          <issue>4</issue>
          (
          <year>1985</year>
          ),
          <fpage>268</fpage>
          -
          <lpage>279</lpage>
          . https://doi.org/10.1145/971699. 318923
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Viacheslav</surname>
            <given-names>Galaktionov</given-names>
          </string-name>
          , Evgeniy Klyuchikov, and
          <string-name>
            <surname>George</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chernishev</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Position Caching in a Column-Store with Late Materialization: An Initial Study</article-title>
          .
          <source>In Proceedings of DOLAP@EDBT/ICDT 2020 (CEUR Workshop Proceedings)</source>
          ,
          <source>IlYeol Song, Katja Hose, and Oscar Romero (Eds.)</source>
          , Vol.
          <volume>2572</volume>
          . CEUR-WS.org,
          <volume>89</volume>
          -
          <fpage>93</fpage>
          . http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2572</volume>
          /short14.pdf
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Goetz</given-names>
            <surname>Graefe</surname>
          </string-name>
          .
          <year>1993</year>
          .
          <article-title>Query Evaluation Techniques for Large Databases</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>25</volume>
          ,
          <issue>2</issue>
          (
          <year>June 1993</year>
          ),
          <fpage>73</fpage>
          -
          <lpage>169</lpage>
          . https://doi.org/10.1145/152610.152611
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Graefe</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Volcano - An Extensible and Parallel Query Evaluation System</article-title>
          .
          <source>IEEE Trans. on Knowl. and Data Eng. 6</source>
          ,
          <issue>1</issue>
          (Feb.
          <year>1994</year>
          ),
          <fpage>120</fpage>
          -
          <lpage>135</lpage>
          . https: //doi.org/10.1109/69.273032
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Martin</surname>
            <given-names>Grund</given-names>
          </string-name>
          , Jens Krueger, Matthias Kleine, Alexander Zeier, and
          <string-name>
            <given-names>Hasso</given-names>
            <surname>Plattner</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Optimal query operator materialization strategy for hybrid databases</article-title>
          .
          <source>In Proceedings of the 2011 Third International Conference on Advances in Databases, Knowledge, and Data Applications (DBKDA '11)</source>
          . IARIA,
          <fpage>169</fpage>
          -
          <lpage>174</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Stavros</surname>
            <given-names>Harizopoulos</given-names>
          </string-name>
          , Daniel Abadi, and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Boncz</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Column-Oriented Database Systems</article-title>
          , VLDB 2009 Tutorial. nms.csail.mit.edu/~stavros/pubs/ tutorial2009-column_stores.pdf
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Joseph</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hellerstein</surname>
            and
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Stonebraker</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Readings in Database Systems</article-title>
          . The MIT Press, pp.
          <fpage>96</fpage>
          -
          <lpage>.</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Stratos</surname>
            <given-names>Idreos</given-names>
          </string-name>
          , Fabian Grofen, Niels Nes, Stefan Manegold,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sjoerd Mullender</surname>
          </string-name>
          , and
          <string-name>
            <surname>Martin</surname>
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Kersten</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>MonetDB: Two Decades of Research in Column-oriented Database Architectures. IEEE Data Eng</article-title>
          .
          <source>Bull. 35</source>
          ,
          <issue>1</issue>
          (
          <year>2012</year>
          ),
          <fpage>40</fpage>
          -
          <lpage>45</lpage>
          . http://sites.computer.org/debull/A12mar/monetdb.pdf
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Setrag</surname>
            <given-names>Khoshafian</given-names>
          </string-name>
          , George P. Copeland, Thomas Jagodis, Haran Boral, and
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Valduriez</surname>
          </string-name>
          .
          <year>1987</year>
          .
          <article-title>A Query Processing Strategy for the Decomposed Storage Model</article-title>
          .
          <source>In Proceedings of the Third International Conference on Data Engineering</source>
          . IEEE Computer Society, Washington, DC, USA,
          <fpage>636</fpage>
          -
          <lpage>643</lpage>
          . http: //dl.acm.org/citation.cfm?id=
          <volume>645472</volume>
          .
          <fpage>655555</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Donald</given-names>
            <surname>Kossmann</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>The State of the Art in Distributed Query Processing</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>32</volume>
          ,
          <issue>4</issue>
          (Dec.
          <year>2000</year>
          ),
          <fpage>422</fpage>
          -
          <lpage>469</lpage>
          . https://doi.org/10.1145/371578. 371598
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Lamb</surname>
          </string-name>
          , Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and
          <string-name>
            <given-names>Chuck</given-names>
            <surname>Bear</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>The Vertica Analytic Database: C-Store 7 Years Later</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>5</volume>
          ,
          <issue>12</issue>
          (Aug.
          <year>2012</year>
          ),
          <fpage>1790</fpage>
          -
          <lpage>1801</lpage>
          . https://doi.org/ 10.14778/2367502.2367518
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Nadezhda</surname>
            <given-names>Mukhaleva</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Valentin D.</given-names>
            <surname>Grigorev</surname>
          </string-name>
          , and
          <string-name>
            <surname>George</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chernishev</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Implementing Window Functions in a Column-Store with Late Materialization</article-title>
          .
          <source>In Model and Data Engineering - 9th International Conference, MEDI 2019</source>
          , Toulouse, France,
          <source>October 28-31</source>
          ,
          <year>2019</year>
          ,
          <source>Proceedings (Lecture Notes in Computer Science)</source>
          ,
          <source>Klaus-Dieter Schewe and Neeraj Kumar Singh (Eds.)</source>
          , Vol.
          <volume>11815</volume>
          . Springer,
          <fpage>303</fpage>
          -
          <lpage>313</lpage>
          . https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -32065-2_
          <fpage>21</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Thorsten</surname>
            <given-names>Papenbrock</given-names>
          </string-name>
          , Jens Ehrlich, Jannik Marten, Tommy Neubert, JanPeer Rudolph, Martin Schönberg, Jakob Zwiener, and
          <string-name>
            <given-names>Felix</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>8</volume>
          ,
          <issue>10</issue>
          (jun
          <year>2015</year>
          ),
          <fpage>1082</fpage>
          -
          <lpage>1093</lpage>
          . https://doi.org/ 10.14778/2794367.2794377
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Fotis</given-names>
            <surname>Psallidas</surname>
          </string-name>
          and
          <string-name>
            <given-names>Eugene</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Demonstration of Smoke: A Deep Breath of Data-Intensive Lineage Applications</article-title>
          .
          <source>In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference</source>
          <year>2018</year>
          , Houston, TX, USA, June 10-15,
          <year>2018</year>
          ,
          <string-name>
            <given-names>Gautam</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Christopher M. Jermaine</surname>
          </string-name>
          , and Philip A. Bernstein (Eds.). ACM,
          <volume>1781</volume>
          -
          <fpage>1784</fpage>
          . https://doi.org/10.1145/3183713.3193537
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Fotis</given-names>
            <surname>Psallidas</surname>
          </string-name>
          and
          <string-name>
            <given-names>Eugene</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Smoke: Fine-grained Lineage at Interactive Speed</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>11</volume>
          ,
          <issue>6</issue>
          (
          <year>2018</year>
          ),
          <fpage>719</fpage>
          -
          <lpage>732</lpage>
          . https://doi.org/10.14778/ 3184470.3184475
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Ravishankar</surname>
            <given-names>Ramamurthy</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>David J.</given-names>
            <surname>DeWitt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Qi</given-names>
            <surname>Su</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>A case for fractured mirrors</article-title>
          .
          <source>In Proceedings of the 28th international conference on Very Large Data Bases (VLDB '02)</source>
          .
          <source>VLDB Endowment</source>
          ,
          <volume>430</volume>
          -
          <fpage>441</fpage>
          . http://dl.acm.org/ citation.cfm?id=
          <volume>1287369</volume>
          .
          <fpage>1287407</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>L.</given-names>
            <surname>Shrinivas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bodagala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Varadarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bharathan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bear</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Materialization Strategies in the Vertica Analytic Database: Lessons Learned</article-title>
          .
          <source>In 2013 IEEE 29th International Conference on Data Engineering (ICDE)</source>
          .
          <volume>1196</volume>
          -
          <fpage>1207</fpage>
          . https://doi.org/10.1109/ICDE.
          <year>2013</year>
          .6544909
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Alexander</surname>
            <given-names>Slesarev</given-names>
          </string-name>
          , Evgeniy Klyuchikov, Kirill Smirnov, and
          <string-name>
            <surname>George</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chernishev</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Revisiting Data Compression in Column-Stores</article-title>
          .
          <source>In Model and Data Engineering - 10th International Conference, MEDI 2021</source>
          , Tallinn, Estonia, June 21-23,
          <year>2021</year>
          , Proceedings (Lecture Notes in Computer Science),
          <source>J. Christian Attiogbé and Sadok Ben Yahia (Eds.)</source>
          , Vol.
          <volume>12732</volume>
          . Springer,
          <fpage>279</fpage>
          -
          <lpage>292</lpage>
          . https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -78428-7_
          <fpage>22</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>Cluet</given-names>
            <surname>Sophie</surname>
          </string-name>
          and
          <string-name>
            <given-names>Moerkotte</given-names>
            <surname>Guido</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Eficient Evaluation of Aggregates on Bulk Types</article-title>
          .
          <source>In Proceedings of the Fifth International Workshop on Database Programming Languages. 8.</source>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Mike</surname>
            <given-names>Stonebraker</given-names>
          </string-name>
          , Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden,
          <string-name>
            <surname>Elizabeth O'Neil</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pat O'Neil</surname>
            , Alex Rasin, Nga Tran, and
            <given-names>Stan</given-names>
          </string-name>
          <string-name>
            <surname>Zdonik</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Cstore: A Column-oriented DBMS</article-title>
          .
          <source>In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB '05)</source>
          .
          <source>VLDB Endowment</source>
          ,
          <volume>553</volume>
          -
          <fpage>564</fpage>
          . http://dl.acm.org/citation.cfm?id=
          <volume>1083592</volume>
          .
          <fpage>1083658</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Dimitris</surname>
            <given-names>Tsirogiannis</given-names>
          </string-name>
          , Stavros Harizopoulos, Mehul A.
          <string-name>
            <surname>Shah</surname>
          </string-name>
          , Janet L.
          <string-name>
            <surname>Wiener</surname>
            , and
            <given-names>Goetz</given-names>
          </string-name>
          <string-name>
            <surname>Graefe</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Query Processing Techniques for Solid State Drives</article-title>
          .
          <source>In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD '09)</source>
          . ACM, New York, NY, USA,
          <fpage>59</fpage>
          -
          <lpage>72</lpage>
          . https://doi.org/10. 1145/1559845.1559854
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tuchina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Grigorev</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Chernishev</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>On-the-fly filtering of aggregation results in column-stores</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          <volume>2135</volume>
          (
          <year>2018</year>
          ),
          <fpage>53</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Valduriez</surname>
          </string-name>
          .
          <year>1987</year>
          .
          <article-title>Join Indices</article-title>
          .
          <source>ACM Trans. Database Syst</source>
          .
          <volume>12</volume>
          ,
          <issue>2</issue>
          (
          <year>June 1987</year>
          ),
          <fpage>218</fpage>
          -
          <lpage>246</lpage>
          . https://doi.org/10.1145/22952.22955
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>Eugene</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Systems for Human Data Interaction (keynote)</article-title>
          .
          <source>In Proceedings of the 2nd Workshop on Search</source>
          , Exploration, and
          <article-title>Analysis in Heterogeneous Datastores (SEA-Data 2021) co-located with 47th International Conference on Very Large Data Bases (VLDB</article-title>
          <year>2021</year>
          ), Copenhagen, Denmark,
          <year>August 20</year>
          ,
          <year>2021</year>
          ,
          <string-name>
            <given-names>Davide</given-names>
            <surname>Mottin</surname>
          </string-name>
          , Matteo Lissandrini, Senjuti Basu Roy, and Yannis Velegrakis (Eds.).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>