<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Adaptive Indexing for In-situ Visual Exploration and Analytics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stavros Maroulis</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikos Bikakis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>George Papastefanatos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Panos Vassiliadis</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yannis Vassiliou</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Big Raw Data, Visualization, Visual Analytics, Interactive and</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ATHENA Research Center</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Nat. Techn. Univ. of Athens &amp;, ATHENA Research Center</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Nat. Techn. Univ. of Athens</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Progressive Indexing</institution>
          ,
          <addr-line>Categorical Attributes, RawVis</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Ioannina</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In-situ processing has received a great deal of attention in recent years. In in-situ scenarios, big raw data files which do not fit in main memory, must be efficiently handled using commodity hardware, without the overhead of a preprocessing phase or the loading of data into a database. In this work, we present an adaptive indexing scheme that enables efficient visual exploration and analytics over big raw data files. Beyond visual exploration and statistics, the scheme enables categorical-based analytics using group-by and filter operations. The proposed scheme combines a tile-based structure that offers efficient exploratory operations over the 2D space, with a tree-based structure that organizes a tile's objects based on their categorical values, enabling efficient visual analytics and the support of advanced visualization methods. The index resides in main memory and is built progressively as the user explores parts of the raw file, whereas its structure and level of granularity are adjusted to the user's exploration areas and type of analysis. We conduct experiments using real and synthetic datasets, and demonstrate that the proposed approach, is in most cases more than 40× faster compared to the existing solutions, and performs around 3 orders of magnitude less I/O operations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Commonly, in data exploration scenarios, users wish to visually
interact and analyze large data files that do not fit in main memory ,
e.g., data produced by scientific workflows, IoT devices or
crowdsourcing. These users usually have limited skills in data
management and processing as well as limited resources or commodity
hardware for use, in contrast to, e.g., a distributed environment.
Ideally, the tasks in such scenarios, require a very small raw
datato-analysis time and memory resources, as well as efficient visual
exploration and analytic operations, which are performed via
interactive visualizations, such as maps, scatter plots, histograms,
or other analytical and statistical methods, e.g., OLAP analysis,
correlation, clustering, and regression.</p>
      <p>Consider the following real-word example, which refers to a
common task in telecommunication industry. The data scientists
working in telco companies analyze network data in order to get
insights regarding the network quality of the company. Such data
are stored in comma-separated data files and contain signal
measurements crowdsourced from IoT mobile devices, e.g., connected
cars, mobile phones.1 Using this data, scientists visually explore,
analyze and produce benchmarks regarding the network quality.</p>
      <p>Figure 1(a) presents a sample of a raw file containing vfie
entries/objects (1∼5), where each of them represents a signal
measurement. Basically, each entry contains data regarding the:</p>
      <sec id="sec-1-1">
        <title>1For example, https://www.tutela.com.</title>
        <p>geographic location (Lat, Long), signal strength (Signal) and
network bandwidth (Width), as well the categorical characteristics:
brand, network provider, and network technology (Net).2</p>
        <p>Figure 1(c) outlines our working scenario. Assume that a data
scientist wishes to visually explore the network data using a 2D
visualization technique, e.g., scatter plot, map; and analyze it using
visual analytics and statistics. The user first selects the input file
and a map as the underlying visualization layout. The file is parsed
and an initial "crude" version of the index is constructed A . Then,
the user interacts and performs visual and analytic operations on
the map B .</p>
        <p>For example, the user  on the map the signal
measurements located in a specific geographic area, views  (e.g.,
provider) for the points visualized, or   out the ones that
refer to AT&amp;T C . Next, the user may  (e.g., pan left) the
visualized region in order to explore a nearby area; or -/
to explore a part of the region or a larger area, respectively. For the
visualized points, the user wishes to compute statistics between
numeric attributes, e.g., the Pearson correlation coefficient between
the signal strength and the bandwidth; or visualize its values using
a scatter plot. Finally, the user proceeds with the analysis of the
data based on one or more categorical attributes; e.g., generate:
(1) a heatmap to present the average signal strength per provider
and network technology, or (2) a bar chart to present the average
signal strength for each provider, or (3) parallel coordinates to
present the number of measures grouped by provider, brand, and
network technology D .</p>
        <p>Eventually, each user interaction is mapped to a query
evaluated over the index E and triggers the readjustment of the index
structure and the update of its contents F .</p>
        <p>In the last few years, the in-situ paradigm has been adopted in
the context of data exploration scenarios, referring to data access
methods that enable the analysis over large sets of raw data, i.e.,
data files in raw formats like CSV or JSON. In-situ techniques
attempt to avoid the overhead of fully loading and indexing the data
in a DBMS and improve performance by progressively building
an index as the user explores data.</p>
        <p>
          Recent works in this area have proposed techniques for
progressive loading and indexing of the data, for generic in-situ querying
(mainly range queries) [
          <xref ref-type="bibr" rid="ref20 ref25 ref26 ref33 ref34 ref39 ref6">6, 20, 25, 26, 33, 34, 39</xref>
          ], and for 2D
visual operations over numeric attributes [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ]. However, in
the context of in-situ processing, no attention has been given, to
exploratory aggregate queries, on data with categorical attributes
and specifically, group-by queries that filter and aggregate results
based on taxonomies and controlled vocabularies, which comprise
the domains of the categorical attributes.
        </p>
        <p>As demonstrated in the motivating example, the Group-by
operation is essential in order to generate the most-known visualization
types, in which categorical-based aggregated results are visualized.
Such visualization types include: bar charts, heatmaps, parallel
coordinates, (binned) scatter plots, radar chart, pies, etc. The great
2For simplicity, the numbers shown do not correspond to real values; also, Samsg
stands for Samsung and Veriz for Verizon.</p>
        <p>atL
Abrand = e,lAp{ ,iewaHu ,gasmS m}ioXa
Aprovider = ,TA&amp;{ }rziVe
Anet = ,G3{ ,G4 }G5
(b) ciloregatC esburitA saniomD
nLgo
lniaSg
esubitrA
waR
Dta
ulsiaV itsnaeoprO</p>
        <p>?
render pgsmaSan zoom filter details analysis</p>
        <p>ierzV
gsmaS ierzV</p>
        <p>B
h-Otyflne
scneiPorg &amp;
idgIxne</p>
        <p>
          A
importance of categorical-based visualization types in data
analysis, is also verified by [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] showing that bar charts are by far
the most commonly used visualization type. Beyond the group-by
operation, the Filter operation over categorical attributes, enables
the support of effective exploration mechanisms, such as faceted
search.
        </p>
        <p>In this work, we propose an innovative indexing scheme and
adaptive techniques in the context of in-situ visual exploration,
which supports efficient categorical-based group-by and filter
operations, combined with 2D visual interactions, such as exploration
of data points on maps. To the best of our knowledge, there is no
work studying categorical-based operations in in-situ scenarios.
The proposed scheme employs a tile-based structure which offers
efficient exploration over the 2D plane, with a tree-based structure
that organizes tile’s objects based on its categorical values. The
index resides in main memory and is built progressively as the
user explores parts of the raw file, whereas its structure and level
of granularity are adjusted to the user’s exploration areas and type
of analysis. In our experiments we illustrate that the proposed
method in most cases is about 40× faster and performs up to 3
orders of magnitude less I/O operations, compared to existing
solutions.</p>
        <p>Contributions. In this paper, we provide the following
contributions:
− We formulate exploratory and analytical operations over
categorical attributes (i.e., group-by, filter, and aggregate),
that are mapped to query operators and evaluated over the
underlying indexing scheme.
− We design a main-memory lightweight tree structure that
organizes objects and computes statistics based on
categorical attributes.
− We introduce a hybrid index that combines tile and tree
structures. The index organizes the objects based on two
dimensions, as well as based on the categorical attributes’
values of the objects.
− We design interaction-based adaptation techniques that
progressively adjust the index structure based on the user
interaction.
− We evaluate the performance of our method in terms of
execution time and I/O operations, using real and synthetic
datasets. We show that the proposed approach outperforms
competitors both in execution time and I/O operations.</p>
        <p>
          Compared to our previous work in the context of in-situ visual
exploration [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], this work enables the support of categorical
attributes, which are not previously considered. In this context,
we define three categorical-based operations (Sect. 2), a
treebased structure (Sect. 3) which is integrated with the tile-based
structure proposed in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] (Sect. 4); and initialization (Sect. 4.3),
query processing (Sect. 5.1), and adaptation methods (Sect. 5.2)
for the introduced index and operations. The methods have been
encapsulated into the RawVis system [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] offering efficient in-situ
visual analytics.
        </p>
        <p>Outline. The paper is organized as follows. In Section 2, we
present our exploration model. In Section 3, we describe the tree
structure, and in Section 4, the proposed indexing scheme. Then,
Section 5 presents the query evaluation and the adaptation
techniques. Section 6 presents the experimental evaluation, and
Section 7 the related work. Finally, Section 8 concludes the paper.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>EXPLORATION MODEL</title>
      <p>
        In this section, we present the exploration model, which defines
a set of exploratory and analytic operations and their translation
to data-access operations. In this work, we extend the model
presented in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] by defining three new operations over categorical
attributes. Particularly, we define the group-by, filter, and
aggregate operations.
      </p>
      <p>Raw Data File &amp; Objects. We assume a raw data file F
containing a set of -dimensional objects O. Each dimension corresponds
to an attribute  ∈ A, where  may be spatial, numeric,
categorical, or textual. Each object  is defined as a list of  attribute
values  = (,1, ,2, ..., , ), and it is associated with an offset
 (a hex value) pointing to the “position” of its first attribute
from the beginning of the file F . Also, given an object  and an
attribute , as , we denote the  value of the object  .</p>
      <p>Let A ⊆ A denote the categorical attributes of the objects.
Each categorical attribute  is represented as a finite set of values
 = {1, 2, ... }, which defines the domain of the attribute, i.e.,
 ( ).</p>
      <p>User Interactions. The exploration model denotes a series of
user interactions which are formulated as a set of operations (e.g.,
render, zoom).</p>
      <p>Given a raw data file, the user arbitrarily selects two numeric
attributes  ,  ∈ A, that are mapped to the X and Y axis of
a 2D visualization layout (e.g., scatter plot, map). The  and
 attributes are denoted as axis attributes, while the rest as
nonaxis.3</p>
      <p>The user selects to visualize a rectangular area Φ = ( ,  ),
called visualized area, which is denfied by the two intervals
 = [1, 2] and  = [1, 2] over the axis attributes  and  ,
respectively; i.e., Φ corresponds to the 2D area  ×  . The
visualized area, contains a set of visible objects OΦ ⊆ O, for which the
values of their axis attributes fall within the ranges of that area.4</p>
      <p>In this setting the following operations/interactions are defined:
(1) render: visualizes the objects contained in the visualized area.
(2) move: changes the boundaries of the visualized area, i.e., a
3We assume that the user is familiar with the schema of the data file; otherwise, as a
ifrst step, she may have a preview of it, in terms of loading a small sample.
4In order to express the first query, we assume that the user knows the min/max
values of the axis attributes. Otherwise, these values are determined by parsing the
ifle once.
pan operation (3) zoom in/out: zooms the boundaries of the
visualized area keeping the center point inside Φ fixed. (4) filter :
excludes objects visualized in Φ, based on conditions over the
nonaxis attributes. (5) detail: presents information (e.g., attribute
values) related to the non-axis attributes. (6) group: finds group
of objects based on one or more categorical attributes; i.e., similar
to the group-by operation defined in SQL. (7) analyze: computes
aggregate or statistical functions over all objects or groups of
objects in the visualized area.</p>
      <p>These operations may be combined in a sequence, e.g., render
a region, filter the presented objects, group the objects based on
an attribute and finally compute an average value for the groups.
Exploratory Query. Considering the aforementioned user
operations, we proceed with defining them as data-access operators,
which operate on the underlying data file. Data-access operators
are essentially the building blocks of a single query applied to the
data, which is referred to as exploratory query.</p>
      <p>Given a set of objects O and the axis attributes  and  , an
exploratory query  over O is defined by the tuple ⟨S, F, D, G, N⟩,
where:
Selection clause S: defines a 2D range query (i.e., window query)
specified by two intervals  and  over the axis attributes 
and  , respectively. The Selection clause is denoted as S =
( ,  ) with its intervals to be referred as S. and S. . This
clause, selects the objects OS ⊆ O for which both of their axis
attributes have values within the respective intervals; i.e., their
axis attributes’ values are included in the 2D area (i.e., plane)
specified by the intervals S. and S. . The Selection clause is
mandatory in a query , while the remaining clauses are optional.
Filter clause F: defines a set of conjunction conditions which are
applied on the non-axis attributes. The Filter clause is defined as
F = {1, 2, ... }, where a condition  is a predicate involving
an atomic unary or binary operation over object attributes and
constants. The Filter clause is applied over the selected objects
OS, returning the objects O that satisfy the F conditions.
Details clause D: defines a set of non-axis attributes
D = {1, 2, ... }, for which the values of the objects  (that
satisfy the filter), will be returned by the query.</p>
      <p>Group-by clause G: defines a set of categorical attributes G =
{1, 2, ... } with  ∈ C, which are used in a group-by
operation. Given an set of objects  and a set of attributes C, the
group-by operation partitions  into a set of distinct groups,
denoted as G C , based on the different combinations of the values of
the C attribOutes in the  objects. Thus, here, the Group-by clause
G performs a group-by operation based on its attributes, over the
objects satisfying the filter O , resulting in the groups GOG .
Analysis clause L: defines two sets of algebraic aggregate
functions (e.g., count, sum, mean), where each of them is applied
over a set of numeric attributes, returning a single numeric value.
Particularly, the Analysis clause defines two sets of functions: (1)
 that are computed over the objects O returned by the query;
and (2) G that are computed over each group of objects resulted
by the group-by operations. Thus, the analysis clause is defined
as: L = ( , G).</p>
      <p>Note that the support of algebraic aggregate functions in our
model, enables the computation of a large number of complex
statistics (e.g., Pearson correlation, covariance).5</p>
      <p>
        Intuitively, the Selection and Filter clauses apply restrictions to
the entire space of objects, resulting in a set of qualifying objects
O , which is visually presented. For each object in O , the values
of the attributes included in the Details clause will be returned.
5 More than 90% and 75% of the statistics supported by SciPy [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Wolfram [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
respectively, are defined as algebraic aggregate functions [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ].
      </p>
      <p>Then, the Group-by clause evaluates group-by operations over the
O objects. Finally, the set of aggregate functions of the Analysis
clause are computed over the objects of O , and the objects’
groups generated by the Group-by clause.</p>
      <p>The semantics of query execution involves the evaluation of the
different clauses of the query in the following order: (1) Selection;
(2) Filter; (3) Details; (4) Group-by; (5) Analysis.</p>
      <p>Query Result. The result R of an exploratory query  over O is
defined as R = (V,,D, V , VG), where:
(1) V,,D is a set of tuples corresponding to the objects O
returned by the query. For each object, its tuple contains: () the
values of the axis attributes  and  ; and () the values of the
attributes D defined in the Details clause. Formally,
V,,D = {⟨ : , , ,, ,1, ..., ⟩, ∀ ∈ O }, where
{1, ... } = D.
(2) V is a list of the numeric values resulted from computing
the aggregate functions  over the objects O returned by the
query. Formally, V = {ℓ1 (O ), ℓ2 (O ), ...ℓ (O )}, ∀ℓ ∈  .
(3) VG contains the results of the group-by clause. Particularly,
VG is a set of tuples, where each tuple corresponds to a group
G . The tuple of a group 
 of the group-by resulted groups GO
contains: () the values of the attributes G defined in the group-by
clause; and () the results of the aggregate functions G
(computed over  ). Formally,
G
VG = {⟨ : ,1, ..., , ℓ1 ( ), ...ℓ ( )⟩, ∀ ∈ GO }, where
{1, ... } = G and {ℓ1, ...ℓ } = G.
3</p>
    </sec>
    <sec id="sec-3">
      <title>CATEGORICAL-BASED TREE INDEX</title>
      <p>In this section, we present a tree structure that organizes objects
based on its categorical attribute values, named CET (Categorical
Exploration Tree).6 CET is designed as a lightweight,
memoryoriented, trie-like tree structure. In a nutshell, each tree level
corresponds to a different categorical attribute, and edges to
attribute values. Based on the tree hierarchy, each node is associated
with a set of objects, that are determined based on the node path.
These objects are stored in the leaf nodes.</p>
      <p>Considering the number of attribute value combinations which
are required for categorical indexing, a significant amount of
memory is required. Hence, the design of a memory-efficient
categorical structure is a major challenge, especially in our scenario,
where we consider limited available resources. To this end, in
CET, each object is stored (once) in the leaves, and allocates only
three numeric values (i.e., two axis attributes and a file offset).
Further, statistics are also only stored in the leaves, since the
hierarchical structure of CET allows the efficient computation of
statistics over different levels, by performing efficient, in-memory
aggregate operations. Note that in our implementation categorical
attribute values are mapped to distinct numeric value.</p>
      <p>A second challenge is to reduce the cost of I/O operations
which are crucial in such I/O-sensitive settings. Exploiting the
way CET stores the objects during the initialization phase, we are
able to access the raw file in a sequential manner . The sequential
ifle scan increases the number of I/Os over continuous disk blocks
and improves the utilization of the look-ahead disk cache (more
details in Sect. 5.1).
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>The CET Tree</title>
      <p>In this section, we present the basic concepts of the CET tree.
Given a set of objects O and a list of categorical attributes C =
{0, 1, ... }, a CET tree ℎ organizes the objects ℎ.O based
on the values of the categorical attributes ℎ.C.</p>
      <p>The height of ℎ is |C |, so it has |C | + 1 levels (from 0 to |C |),
with the leaf nodes storing the objects.</p>
      <sec id="sec-4-1">
        <title>6CET is also referred to as tree.</title>
        <p>Apple Xiaomi
e f
Veriz
Huawei</p>
        <p>g o5
(a) CET Tree</p>
        <p>Samsg
h
c
c.S = AT&amp;T, *
c. = {o3, o4}</p>
        <p>Huawei
Apple Xiaomi
i j
o3</p>
        <p>The CET follows a “level-based” organization, where each
level corresponds to a different attribute. Based on the given order
of the attributes C, the nodes at level  have edges that correspond
to a different value of the attribute  ∈ C, i.e.,  ( ).</p>
        <p>Each node , is associated with a sequence of attribute values
. = ⟨0, 1...,  ⟩, that is defined by the path from the root to
node . The sequence contains |C | values, where the value 
corresponds to a value of the attribute at level . Specifically, for
a node  at the level , the first ℎ values in . are the attributes
values found in the path from the root to , while the rest |C | − 
values are assigned with the value any, denoted as ∗.</p>
        <p>Based on the sequence of values ., a node is associated with
a set of objects .O ∈ O, where its attribute values equal to the
sequence’s values. As a result, the tree defines an aggregation
structure, where in each node, the associated objects are the union
of the objects associated with its child nodes.</p>
        <p>Object Entries. Leaf nodes contain references to the data objects,
i.e., object entries. For each object  ∈ .O, an object entry  is
defined as ⟨, , ,,  ⟩, with , , , being the values of the axis
attributes and  the offset (a hex value) of  in the raw file. As
.E we denote the set of object entries stored in the leaf node . In
any case, an object entry has a constant size that is not affected by
the object’s characteristics (e.g, number of attributes), and is equal
to three numeric values: object’s  and  (e.g., two double),
and object’s offset from the beginning of file (e.g., a long). The
ifle offset  defines a “direct and precise” connection between an
object and the raw file.</p>
        <p>Synopsis Metadata. Apart from object entries, each leaf node
 is associated with a set of synopsis metadata .M, which are
(numeric) values calculated by algebraic aggregate functions over
one or more attributes of  .E objects such as sum, mean, sum
of squares of deltas, etc. Using the leaves’ metadata, we are able
to compute the metadata of any internal nodes , by aggregating
the metadata of the descendant nodes of , in a bottom-up fashion.
Example 1. [CET Tree] Figure 2a presents the CET index
constructed for the categorical attributes C = { ,  }.
The dotted lines denote parts of the tree that are not going to be
constructed for this dataset.</p>
        <p>Considering the level-based organization, the level 0
corresponds to the Provider attribute (the first attribute in C), and level
1 to Brand. The nodes in each level have as edges the values of
the level’s corresponding attribute; e.g., edges of node  are the
Provider values: Provider = {Ver, AT&amp;T}.</p>
        <p>Additionally, the node  has the associated sequence values
 . = ⟨AT&amp;T, ★⟩, where AT&amp;T corresponds to the path of , and
the value any is produced by the absence of the Brand attribute (in
the path). Further,  is associated with the objects  .O = {3, 4}
that “match” with the  . values, i.e., have as Provider the value
AT&amp;T and the value any for Brand.</p>
        <p>Regarding the leaf nodes, the leaf  stores the object entries  .E
and the metadata  .M for the objects  .O = {1, 2} that matches
its values  .S = ⟨Veriz, Samsg⟩ (Fig. 2b). Here, metadata stores
statistics regarding the Signal and the Width numeric attributes. ■
3.2</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Tree Operations</title>
      <p>In this section, we define the basic operations over the CET tree
and we study its space complexity.</p>
      <sec id="sec-5-1">
        <title>Insert Object &amp; Construct Operation. The insertion opera</title>
        <p>tion inserts an object  into a CET tree ℎ. It takes as input,
a tree ℎ, the object , and a list of categorical attributes C =
{0, 1, ... } based on which the tree organizes its objects.
The the tree construction operation implemented via an  for
each object defined in these operations.</p>
        <p>In brief, an object  is inserted to the corresponding leaf  . In
order to find  , we need to traverse the tree starting from the root;
during the traversal, nodes and edges which do not exist in the
tree, are constructed. The path that indicates  is defined by the
values of  in C attributes.</p>
        <p>Get Leaves/Objects Based on Filter Conditions. Here we present
the get leaves/objects operation under filter conditions . The
operation returns the leaf nodes L of a tree ℎ, that are matched based
on the categorical conditions defined in the Filter clause F of a
query. The operation, based on the conditions, constructs a path
expression  starting from the root to the leaf nodes. Then, for
each path produced by the path expression, it traverses the tree in a
top-down fashion. The union of the leaves L reached by the paths
is returned. The get objects based on filter condition operation is
implemented by returning the object’s entries of the leaves L.
Space Complexity. Considering the CET insertion process, a
node  is included in the tree, only if its sequence of values .
is associated with one of its objects. In other words, nodes that
represent combinations of attribute values that do no appear in
the data objects are not inserted in the tree structure. As a result,
the number of nodes depends not only on the attributes and its
domain, but also on the (distinct) values of the attributes in the
data objects.</p>
        <p>We can easily verify that the maximum number of nodes in a
CET tree occurs when all possible combinations of values for its
attributes appear in the objects it contains.</p>
        <p>Given the tree attributes ℎ.C = {0, 1, ... }, the
maximum number of nodes ℎ. in the worst case is: ℎ. = |dom(0 ) |+
|dom(0 ) | · |dom(1 ) | + ... + |dom(0 ) | · |dom(1 ) | · ... ·
−1 
|dom( ) | = Í Î |( )|. Also, each object entry is
con=0 =0
tained in only one leaf, we have that the space of all tree objects
is  (|ℎ.O |). Hence, the space complexity of a CET tree ℎ is:
−1 
O ( Í Î | ( ) | + |ℎ.O |).</p>
        <p>=0 =0
4</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>THE VETI INDEX</title>
      <p>In this section, we present the proposed indexing scheme, that
combines tile structures with trees.
20
10</p>
      <p>
        Long
Our work is built on top of the VALINOR index [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], referred
also as tile-structure. VALINOR is employed as the underlying
indexing technique for the support of exploration operations over
a 2D representation of raw data.
      </p>
      <p>VALINOR is a tile-based multilevel index, which is stored in
memory and organizes the data objects of a raw file into
hierarchies of non-overlapping rectangle tiles. Each tile is constructed
over a range of values for the  and  axis attributes.The
index is initialized with a number of tiles and progressively adjusts
itself to the user interactions, by splitting these tiles into more
ifne-grained ones, thus forming a hierarchy of tiles. The basic
concepts of the index are:
Tile &amp; Tile Hierarchies. A tile  is a part of the Euclidean space
defined by two left-closed, right-open intervals intervals  . and
 . , and T is the set of tiles defined in the tile-structure.</p>
      <p>In this work, we assume hierarchies of tiles (i.e., forest),
although a hierarchy with a single root tile can also be defined. In
each level of the hierarchy, there are no overlaps between the tiles
of the same level, i.e., disjoint tiles.</p>
      <p>A non-leaf tile  can have an arbitrary number of child tiles,
whereas leaf tiles are the tiles without children and can appear at
different levels in the hierarchy. Further, a non-leaf tile covers an
area that encloses the area represented by any of its children. That
is, given a non-leaf tile  defined by the intervals  . = [1, 2)
and  . = [1, 2); for each child tile  ′ of  , with  ′. = [1′, 2′ )
and  ′. = [1′, 2′ ), it holds that 1 ≤ 1′, 2 ≥ 2′, 1 ≤ 1′ and
2 ≥ ′ .</p>
      <p>2</p>
      <p>Each tile  encloses a set of objects  .O, where the axis values
, and , of each object  ∈  .O fall within the intervals of the
tile  ,  . and  . , respectively.
4.2</p>
    </sec>
    <sec id="sec-7">
      <title>VETI: Combining Tiles and Trees</title>
      <p>The VETI index (Visual Exploration Tile-Tree Index) combines
the VALINOR tile-based index with the CET tree, with each leaf
tile being associated with a CET tree.</p>
      <p>Given a raw data file F , two axis attributes  and  , and
a set C of categorical attributes of the objects stored in F ; the
VETI index I organizes the objects stored in F into hierarchies of
non-overlapping tiles based on its  ,  values.</p>
      <p>Let IT be the tiles of I. Each leaf tile  ∈ IT is associated with
a CET tree ℎ, denoted as  .ℎ. In analogy, a tree ℎ is associated
with a leaf tile  . A tile’s enclosed objects  .O are stored in the
leaf nodes of the associated tree  .ℎ, as objects entries. In case the
objects of a tile are not indexed based on the categorical attributes,
the tree ℎ corresponds to a node (root) that stores all the object
entries.</p>
      <p>The VETI index I is defined by a tuple ⟨IP, AP⟩, IP is the
initialization policy defining the methods that determine the
characteristics of VETI; AP is the adaptation policy defining the methods
for reconstructing the tiles and trees based on user interaction.</p>
      <p>As basic operations of the VETI we consider: initialization
(Sect. 4.3), query evaluation (Sect. 5.1), and adaptation (Sect. 5.2).
Example 2. [VETI Index] Figure 3 presents the VETI index that
results by the running example data (Fig. 1). VETI divides the
2D space into 4 × 3 equally sized disjoint tiles, and the tile  is
further divided into 2 × 2 subtitles of arbitrary sizes, forming a tile
hierarchy.</p>
      <p>The figure also presents the contents of the  tile, highlighted
with grey color in the index, that contains the the objects 3 and
4. For the tile  , the index stores its intervals  . and  .,
its child tiles  .C, and a pointer to its tree  .ℎ. Finally, the tree
that corresponds to the tile   and the content of the leaf node 
are shown in the figure (the objects entries and metadata details
are omitted in this figure). ■
4.3</p>
    </sec>
    <sec id="sec-8">
      <title>VETI Initialization Overview</title>
      <p>In our scenario, we do not consider any loading or
preprocessing. The index is constructed following the first user interaction.
During the initialization phase the following tasks are realized.
First, the characteristics of the index are determined; then, the file
is parsed and the objects populate the index; finally the query is
evaluated.</p>
      <p>Algorithm 1 outlines the initialization phase. The algorithm
takes as input, the raw file F , the axis and categorical attributes
 ,  and A C and the first exploratory query 0; and returns
the initialized index I and the results R0 of the 0.</p>
      <p>Initially, we have to determine the characteristics of the tile
structure. The determTiles method (line 1) defined by the
initialization policy IP, determines the tile structure (i.e., number, size
and intervals of the tiles). Next, based on the defined tiles’
characteristics, a flat tile structure IT is constructed, i.e., the intervals of
each tile are stored.</p>
      <p>In the next part (loop in line 3), the algorithm scans the file F .
For each object  , the algorithm reads the attribute values of axis
attributes , , , , the values for its categorical attributes, and
the attribute values which are required to evaluate the Analysis
clause of 0 (line 4). Next, the tile  that encloses  is found
(line 6), and the insertToTree method (Sect.3.2), inserts  into the
CET tree  .ℎ (line 7) of  . During the insertion, the object entry
is constructed, the tree metadata are updated, and new parts (i.e.,
nodes, edges) of the tree may be constructed.</p>
      <sec id="sec-8-1">
        <title>Determine Tile Structure. The determTiles method (line 1) de</title>
        <p>
          ifned by the tile initialization policy IP, determines the tile
structure (i.e., number, size and intervals of the tiles). These
characteristics can be defined via numerous approaches (e.g., given by
the user or determined by the visualization setting, resolution).
Here, we use a locality-based probabilistic initialization approach
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], that is based on the first user query. The main idea of this
method is based on the locality-based characteristics of
exploration scenarios, in which queries are highly likely to reside in
areas near the previous ones [
          <xref ref-type="bibr" rid="ref24 ref42 ref8">8, 24, 42</xref>
          ]. Hence, tiles near the
ifrst user query have higher probability to overlap with a future
query, compared to tiles farther away. In order to improve the
query evaluation performance, the index structure has to reduce
the number of I/O operations during query processing. Recall that,
in VETI, aggregate metadata are stored in the tree of each tile.
This way, queries can exploit the tree metadata of the tiles that
overlap, and avoid I/O operations. The use of metadata depends
on whether an overlapping tile is fully or partially-contained in
the query’s window (in 2D space). Apparently, the metadata of a
Algorithm 1. Initialization (F,  ,  , C, 0)
fully-contained tile could be used, without the need to access the
ifle, while in partially-contained cases, the file has to be accessed
in order to compute the metadata referred to the query’s contained
part. More details about query evaluation over VETI are presented
in Section 5.1.
        </p>
        <p>
          Our initialization method defines a tile structure that is more
ifne-grained (i.e., having a large number of smaller tiles) in the
area around the initial query, whereas the size of tiles becomes
larger as their distance from the initial query gets bigger.
Increasing the number (i.e., decreasing the size) of tiles near the first
query, increases the possibility that subsequent user queries in this
neighborhood overlap with fully-contained tiles, which in turn
reduces the I/O cost (for more details see [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]).
        </p>
        <p>Remark. There are cases where the user’s exploration is performed
in more than one sessions. In such scenarios, the user can perform
the exploration, store and reload intermediate states of the index
in subsequent user sessions. Thus, the user avoids the cost of
reinitialization, as well as the fine tuning (e.g., adaptation) of the
index to the previously performed user’s operations.
5</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>QUERY PROCESSING &amp; INDEX</title>
    </sec>
    <sec id="sec-10">
      <title>ADAPTATION</title>
      <p>This section describes the evaluation of the queries and the index
adaptation.
5.1</p>
    </sec>
    <sec id="sec-11">
      <title>Query Processing</title>
      <p>An overview of the query evaluation is presented in Algorithm 2.
The algorithm takes as input, the initialized index I, an exploratory
query  and the raw file F .</p>
      <sec id="sec-11-1">
        <title>Find and Adapt Query Related Tiles &amp; Trees. Once the index</title>
        <p>has been initialized, to evaluate a query , we need to find the tiles
TS that overlap with the 2D area defined in the query’s Selection
clause S (line 2). Specifically, function getOverlappingLeafTiles
ifrst determines the overlapping tiles at the highest level, and then
traverses the tile hierarchy to find the set of overlapping leaf tiles
TS.</p>
        <p>Next, based on the adaptation policy AP, the adapt procedure
(Proc. 1), performs the tile splitting and reorganizes the objects
in the trees of the tiles T created by the splitting process (more
details in Sect. 5.2).</p>
        <p>Then, considering any conditions over categorical attributes that
are defined in the Filter clause, getLeavesBasedOnFilter (Sect. 3.2)
retrieves the leaf nodes L of the affected trees (line 5). In other
words, L are the leaves of the trees that belong to tiles overlapping
the query, after the categorical conditions have been applied in
these trees.</p>
      </sec>
      <sec id="sec-11-2">
        <title>Determine the Objects that Require File Access. Procedure</title>
        <p>getLeavesRequiringFileAccess (O , ) (line 6) determines the
objects for which we have to access the raw file in order to answer
the query.</p>
        <p>Algorithm 2. Query Evaluation (I, , F )</p>
        <sec id="sec-11-2-1">
          <title>Input: I: index (initialized);  ⟨S, F, D, G, L⟩: query; F: raw data file</title>
          <p>Variables: TS: leaf tiles that overlap with the Selection clause (i.e., 2D area);</p>
          <p>T: tiles resulted by adaptation; L: tree leaf nodes selected by the Query;
Parameters: AP: adaptation policy;</p>
          <p>Output: R: result of query 
1 L ← ∅
2 TS ← getOverlappingLeafTiles (IT, S)
3 foreach  ∈ TS do
4 T ← AP.adaptTileAndTree (, )
5 ∀ ∈ T : L ← L Ð getLeavesBasedOnFilter ( .ℎ, F)
//see Sect. 5.2
6 W ( ⟨, V ⟩) ← getLeavesRequiringFileAccess ( L, ) //Set of tuples, where V are
the (objects’) attributes of the leaf  where their values have to be retrieved from the file
7 if W ≠ ∅ then //values are missing — read from file
8 read from file the values of attributes V for each leaf  ∈ W
9 updateLeafMetadata (  ) ∀ ∈ W
10 R ← evaluate  using the objects and the metadata of leaves L
11 return R</p>
          <p>To this end, we need to consider the spatial relation between
the 2D area specified in the Selection clause and the tiles it
overlaps. Specifically, a tile that overlaps a 2D window query can be
partially-contained or fully-contained in it. For a fully-contained
tile, we need not examine its objects in order to find the ones
that are included in the window. Also, metadata that refers to the
objects of a tile can be utilized to avoid accessing the raw file.</p>
          <p>For each leaf node in L, procedure getLeavesRequiringFileAccess
ifrst checks if the tile it belongs to is partially or fully-contained
in the Selection clause of the query. For a node that belongs to
a partially-contained tile we need to retrieve from the file the
attributes included in the Analysis clause for every one of its
objects that is contained in the window query. In contrast, if a node
belongs to a fully-contained tile, we know that all of its objects
are contained in the Selection clause of the query. File access is
required if a Details clause is requested in the query, or no
metadata exist for that node to be used to evaluate the Analysis Clause.
Also, in case the tile has not been initialized with a CET tree and
the query includes a Filter or Group-by clause on some of the
categorical attributes, we need to access the raw file to read these
attributes for all the objects of that tile.</p>
          <p>Next, we access the file for the objects contained in leaves L
and retrieve in memory the missing attributes C (line 8). Note
that if a leaf node belongs to a partially-contained tile, we only
access the file for its objects that are included in the window query.
To improve the performance of reading the missing attributes
from file, we exploit the way the object entries are stored in the
leaves in order to access the file in a sequential manner. During
the initialization of the index, we append the object entries into
the leaf nodes of the CET trees as the file is parsed. As a result,
object entries in every leaf node are stored sorted based on their
ifle offset. When accessing the file, we read the objects from the
leaves following a -way merge based on objects file offset. This
sequential file reading results in faster I/O operation by utilizing
the look-ahead disk cache.</p>
          <p>Then, based on the values read from the raw file, function
updateLeafMetadata computes and updates the metadata of the
corresponding leaf nodes, considering the aggregate functions that
are used in the Analysis clauses of the query.</p>
          <p>Evaluate Query. Finally, we evaluate query  using the objects
OL and metadata of the leaf nodes L (line 10). The evaluation
of the Filter clause of the query starts implicitly when
determining the set of leaf nodes using function getLeavesBasedOnFilter
based on the filter conditions over the categorical attributes. Here,
we use the attribute values V retrieved from the file to check</p>
        </sec>
        <sec id="sec-11-2-2">
          <title>Procedure 1: adaptTileAndTree(, )</title>
          <p>every object in OL against the filter conditions that do not
involve the categorical attributes. Finally, for the evaluation of the
Group-by and Analysis clauses, we utilize existing metadata for
nodes belonging to fully-contained tiles with trees indexing every
categorical attribute included in the Filter and Group-by clauses.
For all other cases, we evaluate the functions requested in the
Analysis clause using the values retrieved from the file.
5.2</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>Incremental Adaptation</title>
      <p>VETI employs an incremental index adaptation technique that
attempts to adapt the index structure to the query workflow of the
user exploration. The adaptation in VETI may incrementally split
a tile that overlaps the Selection clause, into smaller subtiles. This
tile splitting increases the likelihood that a future query will fully
overlap a tile in the area that the user exploration focuses, and will
improve query performance by reducing accesses to the file.</p>
      <p>Procedure adapt (Proc. 1) is responsible for the incremental
adaptation. It takes as input a tile  and a query  and returns a set
of subtiles T if the tile  needs to be split.</p>
      <p>
        To split, or not to split? During query processing, we examine
each tile that overlaps the window query if it needs to be split.
To determine if a tile requires (further) splitting, splitRequired
function (line 1) estimates the number of I/O operations needed to
evaluate query  in a tile  and compares it to a numeric threshold
for the maximum number of I/Os. If the expected I/O cost for a
tile exceeds that threshold, a split is performed. To estimate the
number of necessary operations, we take into consideration not
just the Selection clause but also the Filter clause. Specifically,
even if the objects belonging to a tile may exceed the threshold
set, query  may filter on a categorical attribute and using the
tile’s CET tree the number of objects we need to examine may be
much less than the threshold. Details about the splitting model are
presented in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        Tile Splitting. The split is performed in function splitTile (line 2)
which returns a new set of tiles T . Each one of the child tiles that
result contains a tree with the same set of categorical attributes
as their parent tile. The objects contained in the leaf nodes of
the parent tile’s tree are reorganized in the leaf nodes of the new
trees according to their values for the axis attributes, as well as
the categorical attributes. In our implementation for VETI, we
employ a quad-tree like splitting approach in which a tile is split
into 4 equal subtiles. However, more sophisticated methods can
be used to split a tile, e.g. query based splitting methods [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
6
6.1
      </p>
    </sec>
    <sec id="sec-13">
      <title>EXPERIMENTAL ANALYSIS</title>
    </sec>
    <sec id="sec-14">
      <title>Experimental Setup</title>
      <p>Datasets. We have used a real dataset, the NYC Yellow Taxi Trip
Records (TAXI), which is a CSV file, containing information
regarding yellow taxi rides in NYC7. From this dataset, we selected
a subset that includes taxi trip records in 2014 (165M objects,
26 GB) with each record object referring to a specific taxi ride
described by 18 attributes (e.g., pick-up and drop-off dates and
7Available at: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
locations, trip distances, fares, and tip amount). From the TAXI
dataset, we selected the pickup location longitude and latitude as
the axis attributes of the exploration, while the tip amount was
selected as the attribute in the Analysis clause, for which
aggregates were evaluated. The TAXI dataset includes 5 categorical
attributes (e.g. payment type, passenger count, and rate type) with
cardinality varying between 2 and 9. Each query is defined over
an area of 500m × 500m size (i.e., window size), simulating a
map-based exploration at the neighborhood zoom level, with the
ifrst query 0 posed in central Manhattan. Each query contains a
Group-by clause on the passenger count attribute, while the Filter
clause includes 1 to 2 filter conditions randomly specified over
the remaining categorical attributes.</p>
      <p>Regarding the synthetic datasets (SYNTH10/50), we have
generated two files of 100M data objects, having 10 and 50 attributes
(11 and 51 GB, respectively). The synthetic datasets contain
numeric attributes in the range (0, 1000), as well as 6 categorical
attributes. The default cardinality for the categorical attributes is
10, while we also generated three variations of the SYNTH10
dataset where the cardinality of every categorical attribute was
20 and 50 respectively. All attributes in the synthetic datasets
follow a uniform distribution. In our experiments, we selected
two of the numeric attributes as the axis attributes of a 2D
exploration scenario, and another one was selected as the attribute
for which aggregates were evaluated. For the query sequences we
generated for the synthetic dataset, we used a window size with
approximately 100K objects.</p>
      <p>
        Evaluation Scenarios. We study the following visual exploration
scenario: (1) First, the user selects the two axis attributes and
requests to explore a region of the data from the raw file,
specifying also an attribute for which the aggregate functions (avg, sum,
min, max and standard deviation) will be evaluated, as well as
selecting a set of categorical attributes that are of interest for the
exploration. For this action, referred to as “From-Raw
Data-to1stResults”, we measure the execution time for creating the index
and answering the first query, the results of which are evaluated
directly on the raw file, during index initialization. (2) Next, the
user continues exploring neighboring areas, while also filtering
the data and analyzing it based on values for a Group-by attribute.
For this, we generated sequences of 100 overlapping queries, with
each window query shifted in relation to its previous one by 1-20%
towards a random direction. This scenario attempts to formulate
a common user?s behavior in 2D visual exploration, where the
user explores nearby regions using pan operations [
        <xref ref-type="bibr" rid="ref24 ref42 ref8">8, 24, 42</xref>
        ]. For
example, assume the common “region-of-interest” or “following-a
path” scenarios in map visual exploration.
      </p>
      <p>Further, each query contains a Group-by clause over a single
categorical attribute for the Group-by clause, while we randomly
alternated the Filter clause to include 1 or 2 equality conditions
over the remaining categorical attributes of the dataset. To reflect
our exploration-oriented assumption that attributes included in the
initial query 0 are more likely to be included in the next queries,
we generated the query sequences so that these attributes appeared
more frequently in the Filter clause of the queries.</p>
      <p>
        VETI Parameters. The VETI’s tile structure was initialized with
100 × 100 equal-width tiles, while an extra 20% of the number
|T0 | of initial tiles was also distributed around the first query 0
using the locality-based initialization method [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The numeric
threshold for the adaptation of VETI was set to 200.
      </p>
      <p>Index Initialization Budget. In our experiments, we assume a
categorical-based index initialization budget, which is an upper
bound of the memory required for constructing the CET trees of
the tiles. This budget includes only the memory allocated by the
tree structures, and does not include the memory required to store
the object entries. The tree memory cost is mainly determined by
the number of tree nodes, which is in turn defined by the attributes
and their domains. In our estimation for the tree cost, we assume</p>
      <p>PRAW
28000 sec
)
c
se4000
(
e
m
i
T
n
o
i
tz
a
ilii
a
t
In 0</p>
      <p>SYNTH10</p>
      <p>SYNTH50</p>
      <p>TAXI
that all distinct values of an attribute appear in the objects of a
tile and, as a result, in the tree to be constructed. Based on the
initialization budget and the estimated cost of the CET tree, we
sort tiles based on their distance from 0 and assign trees to them
until the initialization budget is exhausted. For the remaining tiles,
we do not construct tree structures during initialization. The CET
tree of such a tile will be constructed on demand, should a future
query overlap it. In our experiments, the index initialization budget
was set to 2GB.</p>
      <p>
        Competitors. We compare our method with: (1) VALINOR [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
which contains only the tile-based indexing structure without the
CET tree structure for supporting efcfiient evaluation of queries
with categorical attributes; (2) A popular DBMS (MySQL 8.0.22),
in which the entire dataset is loaded and indexed in advance; three
indexing settings are considered: () no indexing (SQL-0I); ()
one composite B-tree on the two axis attributes (SQL-1I); and
() two single B-trees, one for each of the two axis attributes
(SQL-2I). MySQL also supports SQL querying over external files
(see CSV Storage Engine in Sect. 7); however, due to low
performance [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], we do not consider it as a competitor in our evaluation.
(3) PostgresRaw (PRAW)8, build on top of Postgres 9.0.0 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
which is a generic platform for in-situ querying over raw data
(Sect. 7).
      </p>
      <p>Metrics. In our experiments, we measure the: (1) execution time
for each query; (2) accumulative execution time for the entire
exploration scenario; and (3) the number of I/O operations.
Implementation. We have implemented VETI on JVM 1.8 and
the experiments were performed on an 3.60GHz Intel Core i7 with
64GB of RAM. We applied memory constraints (12GB max Java
heap size); however, PRAW required more than 32GB and 50GB
for the synthetic and the TAXI dataset, respectively.
6.2</p>
    </sec>
    <sec id="sec-15">
      <title>Results</title>
      <p>Initialization Time. In this experiment, we measure the time
required to answer the first query 0. This time also includes
the time required for initializing each one of the methods we
examine. Besides executing 0, MySQL needs to load and index
the data, while PRAW needs to parse the raw file and construct
the positional map. When evaluating 0, VALINOR generates the
tile index structure and populates it with the object entries, while
VETI needs to also construct the categorical-based tree indexes.</p>
      <p>Figure 4 presents the results for every dataset used. MySQL
exhibits the worst performance for evaluating the first query. MySQL
needs to parse all attributes of the raw file and store all data on
disk. Also, for the SQL-1I and SQL-2I cases, the corresponding
indexes must be built, which explains the increased initialization
time in relation to SQL-0I where no index is generated.</p>
      <p>Both VALINOR and VETI exhibit better initialization
performance compared to PRAW for the SYNTH50 and TAXI datasets,
while for the SYNTH10 dataset VETI requires a slightly higher
initialization time.</p>
      <p>VETI is slightly slower during the initialization compared to
VALINOR. This is expected as VETI needs to also determine the</p>
      <sec id="sec-15-1">
        <title>8https://github.com/HBPMedical/PostgresRAW</title>
        <p>160</p>
        <p>10 20 50</p>
        <p>Categorical Attribute Cardinality
tile-tree assignments, parse the categorical attributes for all objects
and create the corresponding tree structures. Despite this slight
difference in initialization time, VETI performs much faster than
VALINOR when answering queries that include the categorical
attributes.</p>
        <p>Modify the Cardinality of Categorical Attributes. For this
experiment, we study the effect of the cardinality of the categorical
attributes in the performance of VETI. We ran this experiment
against 3 versions of the SYNTH10 dataset where the cardinality
of the categorical attributes for each one was set to 10, 20 and 50
respectively. The cumulative time needed to execute the complete
query sequence of the exploration scenario is shown in Figure 5.
It is the time needed to execute the complete workload by VETI,
including 0 which is depicted separately from all subsequent
queries as it represents the initialization time of the index. Observe
that the initialization time increases slightly with increasing
cardinality. This increase can be attributed to the higher construction
cost of the wider CET trees when the cardinality of the categorical
attributes increases. On the contrary, the total execution time of
queries 1 ∼ 99 decreases with higher cardinality (2.8, 1.4, 0.8
for cardinality 10, 20 and 50 respectively). This decrease in
execution time is directly related to the number of I/O operations we
need to perform during the workload. The objects of the synthetic
dataset have values uniformly distributed over each attribute’s
domain. As a result, in the case of a higher cardinality for the
categorical attributes, the same number of filter conditions are
satisfied by a smaller number of objects, requiring fewer overall
I/O operations.</p>
        <sec id="sec-15-1-1">
          <title>Performance during the Exploration Scenario. In the next ex</title>
          <p>periment, we compare the query evaluation performance of VETI
against the competitors. The execution time for queries 1 ∼ 99
is shown in Figure 6. Note that 0 is omitted in this figure, as it
includes the initialization time which was examined separately in
Figure 4. In the results, we omit the plots for MySQL as it exhibits
a much higher execution time.</p>
          <p>Compared to the other methods, VETI exhibits significantly
lower execution time in almost all cases. Regarding PRAW, we
observe that it performs much worse than both VALINOR and VETI
for all datasets examined. The positional map used in PRAW,
attempts to reduce the parsing and tokenizing costs of future queries,
by maintaining the position of specific attributes for every object
in the raw file. However, PRAW still needs to examine all objects
in the dataset in order to select the ones contained in a 2D window
query. Also, in contrast to VETI, PRAW does not keep any
metadata in order to efficiently compute the aggregate queries. Observe
that some of the early queries in the sequence exhibit execution
time significantly higher than the rest, and comparable to the time
required to answer 0. Since the queries we issue have 1 or 2
randomly specified filter conditions on the categorical attributes,
these queries need to tokenize and parse attributes that were not
included in 0, and populate the positional map with these. This
explains their higher execution time. Once the positional map has
been populated with all the attributes including in the queries,
PRAW exhibits a relatively constant execution time.</p>
          <p>Compared to VALINOR, as we can observe, VETI exhibits
a much faster execution time for every query in the workload.</p>
          <p>PRAW</p>
          <p>PRAW
Q1</p>
          <p>Q15 Q29 Q43 Q57 Q71 Q85 Q99</p>
          <p>Query Sequence
Even though both indexes attempt to adapt to the workload and
maintain metadata to speed up query execution time by reducing
I/Os, VALINOR does not include any indexing capabilities for
categorical attributes. This results in not being able to utilize
tilebased aggregate metadata to evaluate queries that include a
Groupby clause or a Filter clause that refers to a categorical attribute. In
contrast, VETI organizes object entries and metadata in tile-tree
structures based also on the values for the categorical attributes.
For this reason, when a query includes a categorical attribute,
VETI traverses the tree structures of every overlapping tile and
iflters the objects it needs to retrieve from the raw file. Also, if
any leaf tree nodes of a fully-contained tile contain aggregate
metadata, it can be utilized to avoid accessing the raw file.</p>
          <p>The execution time for VETI as well as for VALINOR is mainly
determined by the number of rows that need to be retrieved from
the raw file. This can be seen in Figure 7, where the I/O plots
exhibit approximately the same behavior as the corresponding
execution time plots in Figure 6. Compared to VALINOR, VETI
requires fewer I/O operations for every query in the sequence.
VALINOR has to access the raw file for every object contained in
the 2D window query since it has to retrieve their values for the
categorical attributes specified in the Filter and Group-by clause.
7</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-16">
      <title>RELATED WORK</title>
      <p>
        In-situ Processing. Data loading and indexing usually take a large
part of the overall time-to-analysis for both traditional RDBMs and
Big Data systems [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. In-situ query processing aims at avoiding
data loading in a DBMS by accessing and operating directly over
raw data files. NoDB [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is a philosophy for constructing a
nodbms querying architecture, and PostgresRAW is one of the first
efforts for in-situ query processing. PostgresRAW incrementally
builds on-the-fly auxiliary indexing structures called “positional
maps” which store the file positions of data attributes, as well as it
stores previously accessed data into cache. As opposed to VETI,
the positional map in PostgresRAW, can only be exploited to
reduce parsing and tokenization costs during query evaluation and
can not be used to reduce the number of objects examined in
twodimensional range queries with filter conditions on categorical
attributes.
      </p>
      <p>
        DiNoDB [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ] is a distributed version of PostgresRAW. In the
same direction, RAW [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] extends the positional maps in order to
both index and query files in formats other than CSV. In the same
context, Proteus [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] supports various data models and formats.
Recently, Slalom [
        <xref ref-type="bibr" rid="ref33 ref34">33, 34</xref>
        ] exploits the positional maps and
integrates partitioning techniques that take into account user access
patterns.
      </p>
      <p>
        RawVis [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ] defines a tile-based index in the context of
in-situ visual exploration, supporting 2D visual operations over
numeric attributes. Compared to VETI, RawVis does not support
operations and indexing over categorical attributes. As a result,
it cannot exploit well-know visualization techniques, such as bar
charts, heatmaps, pies and parallel coordinates. Particularly, VETI
extends a tile-based structure similar to RawVis, with trees that
enrich tiles with information about categorical attributes.
      </p>
      <p>
        Additionally, several well-known DBMS support SQL querying
over CSV files. Particularly, MySQL provides the CSV Storage
Engine [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Oracle offers the External Tables [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and PostgreSQL
has the Foreign Data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, these tools do not focus on user
interaction, parsing the entire file for every query, and resulting
in significantly low query performance for interactive scenarios
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The aforementioned works study the generic in-situ
querying problem without focusing on the specific needs for raw data
visualization and exploration. In addition, due to low query
performance cannot be used in exploration scenarios. Instead, our
work considers the in-situ processing of a specific query class,
that enables user operations and visual analytics. The goal of our
solution is to optimize these operations, so that visual interaction
with raw data is performed efficiently on very large input files
using commodity hardware.
      </p>
      <p>
        Visual-Oriented Indexes. In the context of visual exploration,
several indexes have been introduced. VisTrees [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and HETree
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] are tree-based main-memory indexes that address visual
exploration use cases; i.e., they offer exploration-oriented features
such as incremental index construction and adaptation. Compared
to our work, both indexes focus on one-dimensional visualization
techniques (e.g., histograms), do not support categorical attributes
and group-by analytics, and do not consider disk storage; i.e., data
stored in-memory.
      </p>
      <p>
        Nanocubes [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], Hashedcubes [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], SmartCube [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], Gaussian
Cubes [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ], and TopKubes [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] are main-memory data structures
defined over spatial, categorical and temporal data. The
aforementioned works are based on main-memory variations of a data
cube in order to reduce the time needed to generate
visualizations. Nanocubes [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] attempts to reduce the memory of the data
cube by sharing nodes in a single tree structure. Hashedcubes [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
follows a different approach where, instead of materializing all
possible aggregations, it uses a partial ordering of the dimensions
and the notion of pivot arrays to calculate on-the-fly the
aggregations missing. Smartcube [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] is a variation of Nanocubes, where
instead of pre-computing all cuboids from the start, it chooses
some important ones based on the user queries, in order to reduce
memory usage. Also, it may adaptively change stored cuboids
when querying patterns change. In comparison with our work,
the indexes in the aforementioned works are generated during a
preprocessing phase, and thus do not address the need of reducing
the initialization time, i.e., they cannot be used in in-situ scenarios.
Moreover, a major difference compared to our approach, is that
these works assume that all the aggregations are materialized and
stored in memory, which in the case of very fine-grained spatial
indexing or many categorical dimensions, can require
prohibitive amounts of memory. In contrast, our approach adapts the
granularity of the spatial index based on user interaction.
      </p>
      <p>
        Further, graphVizdb [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ] is a graph-based visualization tool,
which employs a 2D spatial index (e.g., R-tree) and maps user
interactions into window 2D queries. To support the operation of
the tool, a partition-based graph drawing approach is proposed.
Compared to our work, graphVizdb requires a loading phase where
data is first stored and indexed in a relational database system. In
addition, it targets only graph-based visualization.
(b) SYNTH50
Q1
      </p>
      <p>Q15</p>
      <p>
        In a different context, tile-based structures are used in visual
exploration scenarios. Semantic Windows [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] considers the
problem of finding rectangular regions (i.e., tiles) with specific
aggregate properties in exploration scenarios. ForeCache [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] considers
a client-server architecture in which the user visually explores
data from a DBMS. The approach proposes a middle layer which
prefetches tiles of data based on user interaction. Our work
considers different problems compared to the aforementioned
approaches.
      </p>
      <p>
        We should note that there are a large number of spatial indexes
such as the R-tree, kd-tree, quadtree [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] which could be used in
the context of data exploration. However, most of these structures
consider several criteria (e.g., tree balance, fill guarantees) in
order to improve query processing, which results in a significant
amount of time required for their construction [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. As a result,
they are not suitable for the in-situ setting, which requires small
initialization overhead.
      </p>
      <p>
        Adaptive Indexing. Similarly to VETI, the basic idea of
approaches like database cracking and adaptive indexing [
        <xref ref-type="bibr" rid="ref17 ref18 ref19 ref21 ref22 ref23 ref32 ref37 ref38 ref7">7, 17–
19, 21–23, 32, 37, 38</xref>
        ], is to incrementally build and adapt indexes
during query processing, following the characteristics of the
workload. However, in these works the data has to be previously loaded
in the system, i.e., a preprocessing phase is required. As a
result, these approaches are not suitable for in-situ query scenarios,
where the cost of the preprocessing phase has to be minimized.
In addition, the existing cracking and adaptive indexing methods
have been developed in the context of column-stores [
        <xref ref-type="bibr" rid="ref17 ref18 ref19 ref21 ref22 ref23 ref37 ref7">7, 17–19, 21–
23, 37</xref>
        ], or MapReduce systems [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ]. On the other hand, VETI
has been developed to handle raw data stored in text files with
commodity hardware. Finally, [
        <xref ref-type="bibr" rid="ref35 ref36">35, 36</xref>
        ] study the problem of
incremental indexing for 3D spatial data.
8
      </p>
    </sec>
    <sec id="sec-17">
      <title>CONCLUSIONS</title>
      <p>In this paper, we have presented an indexing scheme and adaptive
processing methods for in-situ visual exploration and analysis that
allow the user to combine visual exploration of data from a raw
ifle on a 2D canvas with sophisticated analysis over its categorical
attributes. This scheme combines tile structures with trees and
is progressively adapted based on user interaction. Finally, the
presented method was evaluated experimentally.</p>
      <p>Acknowledgment. This work is funded by the project VisualFacts (#1614
- 1st Call of the Hellenic Foundation for Research and Innovation Research</p>
      <sec id="sec-17-1">
        <title>Projects for the support of post-doctoral researchers).</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] MySQL: The CSV Storage Engine</article-title>
          . https://dev.mysql.com/doc/refman/8.0/en/ csv-storage-engine.html.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Oracle</surname>
          </string-name>
          :
          <article-title>External Table Enhancements in Oracle Database 12c Release 1</article-title>
          . https: //oracle-base.com/articles/12c/external
          <article-title>-table-enhancements-12cr1.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>PostgreSQL: Foreign</given-names>
            <surname>Data</surname>
          </string-name>
          . https://www.postgresql.org/docs/current/ ddl-foreign
          <article-title>-data</article-title>
          .
          <source>html.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] SciPy: Open Source Scientific Tools for Python</article-title>
          . http://www.scipy.org.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Wolfram</surname>
          </string-name>
          : Descriptive Statistics. https://reference.wolfram.com/language/ tutorial/DescriptiveStatistics.html.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>I. Alagiannis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Borovica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Branco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Idreos</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Ailamaki</surname>
          </string-name>
          . Nodb:
          <article-title>Efficient Query Execution on Raw Data Files</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Alexiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kossmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Larson</surname>
          </string-name>
          .
          <article-title>Adaptive Range Filters for Cold Data: Avoiding Trips to Siberia</article-title>
          . PVLDB,
          <volume>6</volume>
          (
          <issue>14</issue>
          ),
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Battle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          .
          <article-title>Dynamic Prefetching of Data Tiles for Interactive Visualization</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bikakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liagouris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krommyda</surname>
          </string-name>
          , G. Papastefanatos, and
          <string-name>
            <given-names>T.</given-names>
            <surname>Sellis</surname>
          </string-name>
          .
          <article-title>Towards Scalable Visual Exploration of Very Large Rdf Graphs</article-title>
          .
          <source>In ESWC</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bikakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liagouris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krommyda</surname>
          </string-name>
          , G. Papastefanatos, and
          <string-name>
            <given-names>T.</given-names>
            <surname>Sellis</surname>
          </string-name>
          .
          <article-title>Graphvizdb: A Scalable Platform for Interactive Large Graph Visualization</article-title>
          .
          <source>In ICDE</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bikakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Maroulis</surname>
          </string-name>
          , G. Papastefanatos, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          .
          <article-title>RawVis: Visual Exploration over Raw Data</article-title>
          .
          <source>In ADBIS</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bikakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Maroulis</surname>
          </string-name>
          , G. Papastefanatos, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          .
          <article-title>In-situ Visual Exploration over Big Raw Data</article-title>
          .
          <source>Information Systems</source>
          ,
          <volume>95</volume>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bikakis</surname>
          </string-name>
          , G. Papastefanatos,
          <string-name>
            <given-names>M.</given-names>
            <surname>Skourla</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Sellis</surname>
          </string-name>
          .
          <article-title>A Hierarchical Aggregation Framework for Efficient Multilevel Visual Exploration and Analysis</article-title>
          .
          <source>SWJ</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>C. A. de Lara Pahins</surname>
            ,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Stephens</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Scheidegger</surname>
            , and
            <given-names>J. L. D.</given-names>
          </string-name>
          <string-name>
            <surname>Comba</surname>
            . Hashedcubes: Simple,
            <given-names>Low</given-names>
          </string-name>
          <string-name>
            <surname>Memory</surname>
          </string-name>
          ,
          <article-title>Real-time Visual Exploration of Big Data</article-title>
          .
          <source>IEEE TVCG</source>
          ,
          <volume>23</volume>
          (
          <issue>1</issue>
          ),
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>El-Hindi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Binnig</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Kraska</surname>
          </string-name>
          . Vistrees:
          <article-title>Fast Indexes for Interactive Data Exploration</article-title>
          .
          <source>In HILD</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>V.</given-names>
            <surname>Gaede</surname>
          </string-name>
          and
          <string-name>
            <given-names>O.</given-names>
            <surname>Günther</surname>
          </string-name>
          .
          <source>Multidimensional Access Methods. ACM Comput. Surv.</source>
          ,
          <volume>30</volume>
          (
          <issue>2</issue>
          ),
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Graefe</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Kuno</surname>
          </string-name>
          .
          <article-title>Self-selecting, self-tuning, incrementally optimized indexes</article-title>
          .
          <source>In EDBT</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>F.</given-names>
            <surname>Halim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Idreos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Karras</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. H. C.</given-names>
            <surname>Yap</surname>
          </string-name>
          . Stochastic Database Cracking:
          <article-title>Towards Robust Adaptive Indexing in Main-Memory Column-Stores</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>5</volume>
          (
          <issue>6</issue>
          ),
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>P.</given-names>
            <surname>Holanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Manegold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mühleisen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Raasveldt</surname>
          </string-name>
          . Progressive Indexes:
          <article-title>Indexing for Interactive Data Analysis</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>12</volume>
          (
          <issue>13</issue>
          ),
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Idreos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Alagiannis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>A.</given-names>
            <surname>Ailamaki. Here Are My Data Files. Here Are My Queries. Where Are My Results? In</surname>
          </string-name>
          <string-name>
            <surname>CIDR</surname>
          </string-name>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Idreos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Kersten</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. Manegold. Database</given-names>
            <surname>Cracking. In</surname>
          </string-name>
          <string-name>
            <surname>CIDR</surname>
          </string-name>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Idreos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Kersten</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Manegold</surname>
          </string-name>
          .
          <article-title>Self-organizing tuple reconstruction in column-stores</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Idreos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Manegold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Kuno</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Graefe. Merging</surname>
          </string-name>
          <article-title>What's Cracked, Cracking What's Merged: Adaptive Indexing in Main-Memory Column-Stores</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>4</volume>
          (
          <issue>9</issue>
          ),
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kalinin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Çetintemel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Zdonik</surname>
          </string-name>
          .
          <article-title>Interactive Data Exploration Using Semantic Windows</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Karpathiotakis</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Alagiannis, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Ailamaki</surname>
          </string-name>
          .
          <article-title>Fast Queries Over Heterogeneous Data Through Engine Customization</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>9</volume>
          (
          <issue>12</issue>
          ),
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Karpathiotakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Branco</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Alagiannis, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Ailamaki</surname>
          </string-name>
          .
          <source>Adaptive Query Processing on Raw Data. PVLDB</source>
          ,
          <volume>7</volume>
          (
          <issue>12</issue>
          ),
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Lins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Klosowski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Scheidegger</surname>
          </string-name>
          .
          <article-title>Nanocubes for Real-Time Exploration of Spatiotemporal Datasets</article-title>
          .
          <source>IEEE TVCG</source>
          ,
          <volume>19</volume>
          :
          <fpage>2456</fpage>
          -
          <lpage>2465</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Yuan</surname>
          </string-name>
          . Smartcube:
          <article-title>An adaptive data management architecture for the real-time visualization of spatiotemporal datasets</article-title>
          .
          <source>IEEE TVCG</source>
          ,
          <volume>26</volume>
          (
          <issue>1</issue>
          ),
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Maroulis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bikakis</surname>
          </string-name>
          , G. Papastefanatos, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          .
          <article-title>RawVis: A System for Efficient In-situ Visual Analytics</article-title>
          . SIGMOD,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>F.</given-names>
            <surname>Miranda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Klosowski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Silva</surname>
          </string-name>
          .
          <article-title>TopKube: A Rank-Aware Data Cube for Real-Time Exploration of Spatiotemporal Data</article-title>
          .
          <source>IEEE TVCG</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>K.</given-names>
            <surname>Morton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Balazinska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grossman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Mackinlay</surname>
          </string-name>
          .
          <article-title>Support the Data Enthusiast: Challenges for Next-generation Data-analysis Systems</article-title>
          . PVLDB,
          <volume>7</volume>
          (
          <issue>6</issue>
          ),
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>V.</given-names>
            <surname>Nathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Alizadeh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Kraska</surname>
          </string-name>
          .
          <article-title>Learning multi-dimensional indexes</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>M.</given-names>
            <surname>Olma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karpathiotakis</surname>
          </string-name>
          , I. Alagiannis,
          <string-name>
            <given-names>M.</given-names>
            <surname>Athanassoulis</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Ailamaki</surname>
          </string-name>
          . Slalom:
          <article-title>Coasting through Raw Data Via Adaptive Partitioning and Indexing</article-title>
          . PVLDB,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>M.</given-names>
            <surname>Olma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karpathiotakis</surname>
          </string-name>
          , I. Alagiannis,
          <string-name>
            <given-names>M.</given-names>
            <surname>Athanassoulis</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Ailamaki</surname>
          </string-name>
          .
          <article-title>Adaptive partitioning and indexing for in situ query processing</article-title>
          .
          <source>VLDBJ</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pavlovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sidlauskas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heinis</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Ailamaki</surname>
          </string-name>
          . QUASII:
          <article-title>query-aware spatial incremental index</article-title>
          .
          <source>In EDBT</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pavlovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Tzirita</given-names>
            <surname>Zacharatou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sidlauskas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heinis</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Ailamaki</surname>
          </string-name>
          .
          <article-title>Space odyssey: efficient exploration of scientific data</article-title>
          .
          <source>In ExploreDB</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>E.</given-names>
            <surname>Petraki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Idreos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Manegold</surname>
          </string-name>
          .
          <article-title>Holistic Indexing in Main-memory Column-stores</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>S.</given-names>
            <surname>Richter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Quiané-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schuh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dittrich</surname>
          </string-name>
          .
          <article-title>Towards zero-overhead static and adaptive indexing in Hadoop</article-title>
          . VLDBJ,
          <volume>23</volume>
          (
          <issue>3</issue>
          ),
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Alagiannis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Liarou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ailamaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Michiardi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Vukolic</surname>
          </string-name>
          .
          <article-title>Dinodb: An Interactive-speed Query Engine for Ad-hoc Queries on Temporary Data</article-title>
          .
          <source>IEEE Transactions on Big Data</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Bhaskar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Scheidegger</surname>
          </string-name>
          .
          <article-title>Gaussian cubes: Real-time modeling for visual exploration of large multidimensional datasets</article-title>
          .
          <source>IEEE TVCG</source>
          ,
          <volume>23</volume>
          (
          <issue>1</issue>
          ),
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wasay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dayan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Idreos</surname>
          </string-name>
          .
          <article-title>Data Canopy: Accelerating Exploratory Statistical Analysis</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yesilmurat</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Isler</surname>
          </string-name>
          .
          <article-title>Retrospective adaptive prefetching for interactive Web GIS applications</article-title>
          .
          <source>GeoInformatica</source>
          ,
          <volume>16</volume>
          (
          <issue>3</issue>
          ),
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>