=Paper= {{Paper |id=Vol-3075/paper5 |storemode=property |title=Progressive Indexing for Interactive Analytics |pdfUrl=https://ceur-ws.org/Vol-3075/paper5.pdf |volume=Vol-3075 |authors=Michael Hohenstein |dblpUrl=https://dblp.org/rec/conf/gvd/Hohenstein21 }} ==Progressive Indexing for Interactive Analytics== https://ceur-ws.org/Vol-3075/paper5.pdf

Progressive Indexing for Interactive Analytics

Michael Hohenstein
Technical University of Kaiserslautern
Erwin-Schrödinger-Straße, 67663 Kaiserslautern
Kaiserslautern, Germany
hohenstein@cs.uni-kl.de

ABSTRACT algorithms and have investigated potentials and future chal-
Progressive Visual Analytics (PVA) is a recent paradigm in lenges of the paradigm.
the realm of visualization. PVA is closely related to appro- Turkay et al. have discussed potentials and challenges of a
ximate query processing and streaming applications with a progressive data science approach in the field of Machine
focus on real-time interactivity and features supporting a Learning, Data Base Systems and Visualization [18]. They
fluid and insightful experience for data analysts working on identify the following relevant areas of (recent) research in
unknown data sources in an exploratory way. We want to ob- the DB community:
serve the paradigm of progressive data science through the • Approximate Query Processing
lens of database systems, more specifically considering exis-
ting technologies which can be exploited to support progres- • re-use of (partial) results and sampling that takes into
sive systems. In this paper we briefly investigate the simila- account rare subpopulations
rities of PVA to approximate query processing and present
an indexing strategy which can be utilized in approxima- • self-adapting (data organization, indexing, etc.) data
te and progressive environments without conflicting with its structures
intrinsic requirements.
• speculative query processing

Keywords • (progressive) data wrangling and cleaning
progressive visual analytics, approximate query processing, In this paper, we want to take a more detailed look at
indexing, sampling the application of progressive indexing [13] in the context of
PVA. In particular, we will focus on a self-adapting index,
touching the areas of approximate querying and sampling
1. INTRODUCTION to support the requirements of progressive analysis tasks.
Progressive Visual Analytics (PVA) [4, 17] is a relatively The main contributions of the paper is to demonstrate the
new paradigm in the realm of visualization. It’s main objec- synergy between approximate (and by similarity of the con-
tive is to develop algorithms and infrastructure to support cepts progressive) querying approaches and progressive in-
analysts in exploratory ad hoc data analysis. This means dexing. Additionally, we present a sampling and indexing
each query should return an (approximate) result within an strategy useful in realizing systems for progressive query en-
upper time bound, so that data exploration can be consi- vironments. The outline of our paper is as follows: First we
dered a real-time process. Additionally, the analyst should want to give an overview about the paradigm of PVA and
be able to steer the query by tuning parameters of the com- take a look at which existing data base technologies can be
putation. The key principle of the progressive paradigm is utilized to support the progressive paradigm. In sectiion 3
to instantly return an approximate result which is (progres- we present our approach of combining a strategy involving
sively) updated in the background. Ideally some notion of approximate query processing and construction of a progres-
(partial) re-use of earlier results that intersect with a live sive index to support aspects of ”progressiveness”regarding
query should be in place to keep a certain continuity in a fast, reliable intermediate results. We then present an eva-
computation and further reduce the response time of later luation of our efforts and in section 4. We close the paper
queries. Many recent papers on PVA ([3, 7, 8, 10]) have with a conclusion and outlook on future work.
put the focus on features and requirements of progressive
2. RELATED WORK AND BACKGROUND
The iterative nature of the progressive computation can
be found in a number of already existing computational con-
cepts. The most similar concepts that come to mind are
Streaming, Online, and Iterative algorithms. In the realm of
data base applications, the concept of Online Aggregation is
32nd GI-Workshop on Foundations of Databases (Grundlagen von Daten- closely related [9, 12]. Essentially, the field of PVA does not
Copyright ©
banken), September 01-03, 2021, Munich, Germany.
2021 for this paper by its authors. Use permitted under Crea-
tive Commons License Attribution 4.0 International (CC BY 4.0).
introduce a completely new computational paradigm, but
describes a set of features a computational process should
fulfill to to be classified as progressive, Stolper, Aupetit, and
Fekete et al. [7] comprised a definition of features a progres-
sive computation must possess. The computation should:
1. Be bounded on time and data.

2. Report intermediate outputs: a result, and measures
of quality and progress.

3. Be bounded on latency. Figure 1: Original Indexing Approach of Holanda et al. The
entries are all tuples which will be returned for a running
4. Converge towards the true result.
query. [13]
5. Be controllable by a user during execution.
From the perspective of data base systems, this leads to On a basic level, the indexer by Holanda et al. takes a frac-
the following ”progressive”requirements for a system or in- tion δ of the data of and adds them to a growing index (see
frastructure designed for analytical components that adhere Figure 1). The cost of index creation is ßmeared outöver the
to the design goals mentioned above. run time of queries issued to the data base. For a standard
query environment, this means that initially queries will ta-
1. Suitable sampling algorithms with outlier detection. ke rather long, as early inquiries to the data base can not
benefit from indexes and have to scan all tuples sequentially.
2. Fast approximate algorithms.

3. Efficient (partial) reuse to bring continuity into a stee- More precisely, Holanda’s approach is divided into three
red computation. phases. In the Creation Phase a (standard, non approxima-
te) query first performs an index lookup on the fraction of
4. Metrics on quality and progression of the computation. indexed data (ρ). Now, the not-yet-indexed 1 − ρ fraction of
the original column is scanned while expanding the index by
5. Guarantees on user given time-bounds to the first re- a chosen fraction δ of the column size. For the index creati-
sult and to new results. on Holanda et al. use one of four sorting algorithms: Quick-
sort, Bucketsort, Radixsort (MSD), and Radixsort (LSD).
6. Supporting structures to aid data analysts (Indexes,
Depending on the utilized Indexing method, the partial in-
Meta Data, etc.).
dex performs better for different querying scenarios. Once
In the context of this paper, we focus on requirements all data is indexed the approach enters the Refinement Pha-
1, 2, and 6: Sampling with outlier detection, approximate se. There, queries can be performed without scanning any
querying approaches, and indexing techniques to support non-indexed data. Additionally, the existing, rough index is
progressive exploration tasks. More specifically, we propo- refined, progressively converging towards a fully ordered in-
se a progressively built index which is constructed while a dex. Lastly, in the Consolidation Phase the ordered index
user poses fast approximate queries in a real time fashion. is now progressively converted to a B+ tree for query effi-
By real time we mean, that queries should return a result ciency. For an example, please refer to Figure 2 for a short
below the interactivity threshold of 500ms, as proposed by overview of the quicksort indexing appproach of Holanda et
Liu et al [15]. For our approach, we investigated the work al.
of Holanda et al. [13], which describes an index which is While useful in an exact, non approximate querying set-
built progressively while querying a data base, and aim to ting, Holanda states that progressive indexing techniques
integrate this indexing technique with Approximative Query most likely will synergize well with approximate querying
Processing (AQP). Generally, there are two major categories techniques, since the problem of initially longer running que-
of AQP approaches. online aggregation [12] and sampling [1]. ries can be remedied by settling for faster, approximate re-
Sampling approaches (e.g. BlinkDB [2], AQUA [16]) requi- sults before an index structure exists.
re preprocessing to deliver statistically significant results on
lower populations, which clashes with the dynamic nature 3. PROGRESSIVE INDEXING IN THE CON-
of exploratory data analysis. Online aggregation (e.g. CON- TEXT OF APPROXIMATE QUERY PRO-
TROL [11], DBO [14]) struggles to give reliable approxima-
tions for rare subpopulations of a large data set. To remedy CESSING.
these weaknesses, we studied the work of Galakatos et al. The premise of progressive indexing is to iteratively crea-
[9] who employs uniform sampling in tandem with rare sub- te a database index on a previously unknown data sour-
population detection to amount for data skew and outliers. ce and ßmear out”the cost of doing so over several running
We use a similar approach, but combine it with parallel pre- queries. We want to explore if progressive indexing can be
processing to collect distributional metadata of the data set used alongside and/or benefit from a progressive querying
to be explored in order to improve later samples. We bridge approach. Since building a complete index can take a long
the non-interactive preprocessing gap with uniform samp- time, doing so is kind of antithetical for an ad-hoc explora-
ling (which requires no initial preprocessing), refined with tory query session where fast (approximate) results on an
rare subpopulation detection. As such, we are able to quick- unexplored and unprepared data source are the focus. The
ly answer queries with rare populations and data skew taken progressive indexing will incrementally build an index in par-
into account, while samples taken later accurately represent allel to interactive query sessions, speeding up certain que-
the real data. ries, and also preparing the data base for queries requiring an
Figure 3: Flowchart of the interaction between query engine
Figure 2: Example of progressive Indexing using Quicksort
and sampler.
showing the Creation, Refinement, and Consolidation phases.
[13]

exact result. The synergy with approximate querying largely
stems from the fact, that approximate / progressive queries
don’t have to scan the complete data base for a query which
is fired against a data base with no index whatsoever. If only
a subset of tuples is retrieved initially, the indexing can be
done in a more dynamic manner alongside of a fluid ”real ti-
meüser interaction from the start. To explore this venue, we
built a prototype comparing the suitability of various samp- Figure 4: Modified Indexing Approach of Holanda et al. To
ling / indexing techniques based on Holanda’s work in the account for Approximate Querying Techniques.
scope ([19]) of approximate querying applications. Rough-
ly explained, our approach first enters a preparatory stage
in which facilities for the progressive index are created and a subset of the complete data set and does not have to scan
meta-data to ensure statistically significant samples is col- the complete data set.
lected. After this initial stage the partial index is in place Approximate results require a random sample to be sta-
and populated in parallel to approximate querying sessions. tistically relevant. Taking just the first δ rows as a sample
To keep the system interactive, during the preparatory sta- like in the original implementation would not return good
ge queries can be answered in parallel, albeit not with data results. As Galakatos et al. point out [6], we require some
skew or outliers taken into account. initial knowledge about a data set to be able to reach some
The following features are our main contributions which modicum of statistical significance with standard sampling
differentiate our approach to the work of Holanda et al. First, methods (like the total row count for example). To solve
we keep an absolute number s of tuples instead of a certain this dilemma, we utilize Reservoir Sampling [20] to create
fraction δ of the complete data set to determine how many the first sample. Since reservoir sampling misses any skew
tuples to index, since we have no clue how many rows the or outliers in the data, we combine the Reservoir Sampling
data set to be inspected has. Also, not only do we partially technique with preprocessing of the data to be streamed to
index the complete set per query, we also employ approxima- make the following samples more representative of the real
te query processing and trade accuracy of results for speed data set. We count the number of rows of the data set as and
concerning the answering of queries during all stages of que- record the count of attribute groups to get meta data about
rying. In the same vein, we supplement our sampling approa- a columns distribution and the replacement probability for
ches with rare population indexing to improve the statistical each row, as shown in Algorithm 1. While a preprocessing
significance of samples taken from the original data. step might seem to go against to the requirements of the
For a high-level overview of the architecture of the query progressive paradigm, we deem it acceptable as queries can
and sampling engine, please refer to Figure 3. already be answered during the preprocessing, albeit with
less accurate results.
3.1 Initial Sampling and Preprocessing
Since in an approximate environment the cardinality of 3.1.1 Rare Population Indexing
the complete data base is unknown, it is impossible to just Together with the reservoir sample, we build another sam-
determine some fraction δ of the complete data set which ple that takes rare attribute values into account. To do this,
should be indexed per query. Since for our case we always we designate a certain fraction θ as a threshold for a value
take (increasing) samples of the complete data, we index s to count as rare. For this we use the current row count and
many tuples, where s is the size of the sample (See Figure 4). attribute distribution gained during the reservoir preproces-
A nice side effect of this approach is, that an initial partial sing. Any attribute appearing for a fraction of less than θ
index is available much earlier, as the query only processes of the row count will be sampled separately. Since the row
count increases during iteration, we have to continuously
check if we still consider an attribute rare. Once an attribu-
te is deemed rare, it will be removed from the rare sample
pool if it appears more often than (θ ∗2)∗current rowcount.
Using this method will provide sufficient entries in the sam-
ple for rare groups and thus more interesting results, even in
the early phases of sampling. The rare samples however will
be discarded when the system switches to static sampling to
guarantee the correct distribution and proper randomness
(See Algorithm 1).

Algorithm 1: Modified Reservoir Sampling
Input: Data stream to record distribution from.
Output: Random uniform sample of dataset, random
Figure 5: Creation of the Adjusted Stratified Sample based
samples from rare populations.
on the distribution of the age column
1 rowcount ← 0;
2 rareThreshold ← 0;
3 groupCounts ← initialize dictionary;
4 sample ← initialize sample; tial index tree is created. For this we have implemented the
5 for each row in datastream do Progressive Radixsort (MSD) method, similar to Holanda
¯ ¯ et al. [13]. From the initial preprocessing, we create a B+
6 groupCounts ← occurenec of att. value;
7 rowCount += 1; tree index structure for each column in accordance of the
8 if length of sample < max. sample size then amount of different attribute instances we observed. Into
9 add row to sample; this index tree the original reservoir sample is then inserted.
10 else Now, this index tree will bee populated with more samples
11 replacement prob. ← sample size / rowcount; over time, utilizing any down-time during a user session. As
12 if value gets replaced then mentioned earlier in this section, we collected metadata du-
13 replacedRow ← rand. row in curr. sample; ring the initial sampling step about the distribution of the
14 replace row with new row; data. Thus, all future samples which are used to populate
15 if replacedRow is rare then the index structure can be chosen so that they reflect the
16 resample replaced row in separate distribution of the original data using a stratified sampling
sample [5] approach. Data samples are thus pulled randomly from
17 end their respective groups and not just randomly overall. Once
18 else an (approximate) query is issued, the progressive indexing
19 if row is a rare entry then is paused, and the query is only answered by tuples already
20 resample row in separate sample indexed. If an extreme skew is present, it can easily happen
21 end that samples would include less than one entry to reflect the
22 end correct distribution. We opt to include at least one of the-
23 end se very rare datapoints with each sample as illustrated in
24 rareThreshold ← rowCount * θ; Figure 5, since outliers often are a point of interest for ex-
25 end
ploratory data analysis. The distribution in the index is thus
slightly off in the beginning, but since the implementation
keeps pulling samples continuously, this error will correct
itself quickly over time. This also guarantees that enough
3.2 Early Query Executions rare entries are available early to create meaningful appro-
Early queries encompass all queries that are executed be- ximations. When the samples are generated, they are pul-
fore any index is built. The set of tuples these queries are led and indexed sequentially until the data is exhausted. At
executed on is a combination of the existing random uni- this point, the index is fully built and all future queries will
form reservoir sample and parts of the random samples for run with the full index. Note, that while later approxima-
rare populations. If the tuples of the reservoir sample con- te queries take longer due to the ever increasing number of
tain too few of any rare attribute value, tuples contained stratified tuples entered into the index, the gain in result
in the separate index of rare populations will be randomly accuracy is usually worth it. While Holanda’s original ap-
added so that they are not overlooked. This will produce proach only refines the coarse index into a B+ tree after all
more reliable results than straight reservoir sampling, even data has been indexed, we chose to refine the index immedia-
if they are somewhat biased. Mind this mode of querying is tely after indexing a new fragment of the original data. The
only temporary, as it is only included to bridge the time nee- small overhead of immediate refinement to a full B+ tree
ded to preprocess the data set and create the initial index allows for much faster queries than on a coarse index. The-
structure which incorporates a stratified sampling approach re queries would quickly reach undesirable delays if issued
to take into account data skew and rare populations. over large indexes. If the index becomes too large that que-
rying it completely becomes too costly, we can resort either
3.3 Index Generation to querying only a statistically relevant sample of the index
After having created an initial random sample and a full or to pruning the index to a subset of the original data. To
pass over the data to collect distributional metadata, a par- keep these approximate queries or pruned indexes statisti-
Figure 6: Distribution of the age value of the test data set. Figure 7: Distribution of the salary value of the test data set.

cally relevant, we reference the distributional metadata to
create an appropriate approximation of the original data. It
should be obvious, that at this point, the index structure is
not only useful for approximate queries, but will also serve
in scenarios where exact results are required.

4. EVALUATION
Since it is not really interesting if the queries themselves
are faster for an approximate setting instead of a contempo-
rary setting, we will focus on comparing the time until the
progressive indexing converges to a full query compared to
the approach of Holanda et al., and how well the stratified
index manages to account for data skew while uncompleted.

4.1 Setup
For the evaluation, we have implemented the system in
Python version 3.8.3. The experiments were run on a ma-
chine with an Intel 6700k processor at 4.4GHz and 32GB of Figure 8: Duration for the B+ index reaching full convergence
DDR4 RAM at 2133MHz. while repeatedly issuing the simple query (age ≥ 32).

4.2 Data
We have generated a dataset with five million rows contai- seconds to a fully converted index, while our approach only
ning data about the age, salary and received tips of waiters. takes 18 seconds. We thus reach index convergence roughly
To test the indexing prototype, we made sure to integrate 22 times faster than the original approach. This is mostly
different value distributions and amount of skew into the due to the fact that the original approach relies on queries
data. The tip tuples are uniformly distributed. The distri- always ranging over the complete data base to further gene-
bution of the age and salary values (see Figures 6 and 7) is rate the index. As a result, the index cannot be constructed
each skewed towards a particular value range. further until the current query has finished processing all
tuples. Our solution on the other hand utilizes the much
4.3 Time until Index Convergence shorter time to return of approximate queries to be able to
For this experiment, we implemented progressive indexing quickly construct an index.
in a similar fashion as Holanda et al. demonstrated in their
approach (specifically the Radix MSD variation in its base 4.4 Accounting for Data Skew
form) and ran it on the same data set. Only one index on the In this experiment, we looked at how well our sampling ap-
‘age’ column was constructed. Each second we repeatedly is- proach can handle data skew. For this, we ran a count(’age’)
sued a simple query (age ≥ 32) until the index converged. query run once with an initial reservoir sample generated in
We did not implement the adaptive indexing budget pro- the first step before a stratified index is built, and after-
posed by Holanda et al., but instead set the to-be-indexed wards on a reservoir sample including rare subpopulation
fraction with each query to 10%. The sample size for our sampling. The sample size for this test was 5000. Figure 9
version was again set to 5000. In Figure 8 we can see that shows the pitfall of standard reservoir sampling. Since the
a progressive indexing approach greatly benefits from ap- values 25 and 26 represent the bulk of the entries, other ent-
proximate querying techniques. The base version takes 404 ries are easily missed. Especially the rare outliers are excee-
get was used to automatically adjust how much of the query
time should be used for indexing. We could imagine a simi-
lar solution for automatically determining the sample size for
the progressive index in an approximate query environment.
For the stratified sampling, we chose a very simple solution
to determine how the samples should be generated by just
relying on the skew of one column. In a real environment,
this might not be viable since other columns might be just
as interesting. Creating a more refined scheme to account
for skew found in several attributes would further increase
the reliability of the partial results returned by approximate
queries with our proposed indexing strategy.
Figure 9: Distribution of a sample of salary values taken with
reservoir sampling.
6. REFERENCES
[1] S. Acharya, P. B. Gibbons, and V. Poosala.
Congressional samples for approximate answering of
group-by queries. In W. Chen, J. F. Naughton, and
P. A. Bernstein, editors, Proceedings of the 2000 ACM
SIGMOD International Conference on Management of
Data, May 16-18, 2000, Dallas, Texas, USA, pages
487–498. ACM, 2000.
[2] S. Agarwal, A. Panda, B. Mozafari, S. Madden, and
I. Stoica. Blinkdb: Queries with bounded errors and
bounded response times on very large data. CoRR,
abs/1203.5485, 2012.
[3] M. Angelini, T. May, G. Santucci, and H.-J. Schulz.
Figure 10: Distribution of a sample of salary values taken with On quality indicators for progressive visual analytics.
stratified sampling. 06 2019.
[4] M. Angelini, G. Santucci, H. Schumann, and
H. Schulz. A review and characterization of
dingly unlikely to be included in the initial sample. Figure progressive visual analytics. Informatics, 5(3):31, 2018.
10 shows the impact rare population sampling strategy can [5] M. P. Cohen. Stratified sampling. In M. Lovric, editor,
have on a sample of the same size. Even exceedingly rare International Encyclopedia of Statistical Science,
outliers (age 30-36) are included in the sample. While not pages 1547–1550. Springer, 2011.
correct, this sample composition is a much better represen- [6] K. Dursun, C. Binnig, U. Çetintemel, and T. Kraska.
tation of the real data set. Here we can see the advantage Revisiting reuse in main memory database systems.
of including rare population sampling in addition to reser- CoRR, abs/1608.05678, 2016.
voir sampling until we have a true stratified sample. When [7] J. Fekete, D. Fisher, A. Nandi, and M. Sedlmair.
the stratified index structure is built later, data skew largely Progressive data analysis and visualization (dagstuhl
stops being a problem as we now know the distribution of seminar 18411). Dagstuhl Reports, 8(10):1–40, 2018.
values. As shown in Figure 5, the distribution is not exact [8] J. Fekete and R. Primet. Progressive analytics: A
simply because we can not include fractions of tuples, but computation paradigm for exploratory data analysis.
good enough to give a quite accurate picture of the complete CoRR, abs/1607.05162, 2016.
data. [9] A. Galakatos, A. Crotty, E. Zgraggen, C. Binnig, and
T. Kraska. Revisiting reuse for approximate query
5. CONCLUSION AND OUTLOOK processing. PVLDB, 10(10):1142–1153, 2017.
For this work we prototyped a progressive indexing ap- [10] A. Gogolou, T. Tsandilas, T. Palpanas, and
proach usable in an approximate querying environment in- A. Bezerianos. Progressive similarity search on time
spired by a Progressive Indexing approach by Holanda et series data. In Proceedings of the Workshops of the
al. [13]. The AQP environment leads to significantly lower EDBT/ICDT 2019 Joint Conference, EDBT/ICDT
query execution times, even when the data is not fully in- 2019, Lisbon, Portugal, March 26, 2019, 2019.
dexed. To make these approximate results more significant, [11] J. M. Hellerstein, R. Avnur, A. Chou, C. Hidber,
we combined reservoir sampling, meta data collection (rare C. Olston, V. Raman, T. Roth, and P. J. Haas.
population sampling) and a stratified sampling approach to Interactive data analysis: The control project.
remedy for the intrinsic inaccuracy of approximate queries. Computer, 32(8):51–59, 1999.
Our evaluation results confirm Holanda’s assumption that [12] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online
the progressive indexing approach benefits greatly from the aggregation. In J. Peckham, editor, SIGMOD 1997,
much faster time to return of approximate queries on non- Proceedings ACM SIGMOD International Conference
indexed data. For future work on this topic, we identified two on Management of Data, May 13-15, 1997, Tucson,
major areas of potential improvements: Including an inde- Arizona, USA, pages 171–182. ACM Press, 1997.
xing budget, and expanding the stratified sampling strategy. [13] P. Holanda, S. Manegold, H. Mühleisen, and
In the original approach by Holanda et al., an indexing bud- M. Raasveldt. Progressive indexes: Indexing for
interactive data analysis. PVLDB, 12(13):2366–2378, [17] C. D. Stolper, A. Perer, and D. Gotz. Progressive
2019. visual analytics: User-driven visual exploration of
[14] C. M. Jermaine, S. Arumugam, A. Pol, and A. Dobra. in-progress analytics. IEEE Transactions on
Scalable approximate query processing with the DBO Visualization and Computer Graphics,
engine. In C. Y. Chan, B. C. Ooi, and A. Zhou, 20(12):1653–1662, 2014.
editors, Proceedings of the ACM SIGMOD [18] C. Turkay, N. Pezzotti, C. Binnig, H. Strobelt,
International Conference on Management of Data, B. Hammer, D. A. Keim, J. Fekete, T. Palpanas,
Beijing, China, June 12-14, 2007, pages 725–736. Y. Wang, and F. Rusu. Progressive data science:
ACM, 2007. Potential and challenges. CoRR, abs/1812.08032, 2018.
[15] Z. Liu and J. Heer. The effects of interactive latency [19] M. van den Berg. Applying progressive indexing in the
on exploratory visual analysis. IEEE Trans. Vis. context of approximate query processing. 2021.
Comput. Graph., 20(12):2122–2131, 2014. [20] J. S. Vitter. Random sampling with a reservoir. ACM
[16] T. K. Sellis. Review - the aqua approximate query Trans. Math. Softw., 11(1):37–57, 1985.
answering system. ACM SIGMOD Digit. Rev., 2, 2000.