=Paper= {{Paper |id=Vol-3075/paper5 |storemode=property |title=Progressive Indexing for Interactive Analytics |pdfUrl=https://ceur-ws.org/Vol-3075/paper5.pdf |volume=Vol-3075 |authors=Michael Hohenstein |dblpUrl=https://dblp.org/rec/conf/gvd/Hohenstein21 }} ==Progressive Indexing for Interactive Analytics== https://ceur-ws.org/Vol-3075/paper5.pdf
                 Progressive Indexing for Interactive Analytics

                                                              Michael Hohenstein
                                                  Technical University of Kaiserslautern
                                             Erwin-Schrödinger-Straße, 67663 Kaiserslautern
                                                        Kaiserslautern, Germany
                                                          hohenstein@cs.uni-kl.de


ABSTRACT                                                                      algorithms and have investigated potentials and future chal-
Progressive Visual Analytics (PVA) is a recent paradigm in                    lenges of the paradigm.
the realm of visualization. PVA is closely related to appro-                  Turkay et al. have discussed potentials and challenges of a
ximate query processing and streaming applications with a                     progressive data science approach in the field of Machine
focus on real-time interactivity and features supporting a                    Learning, Data Base Systems and Visualization [18]. They
fluid and insightful experience for data analysts working on                  identify the following relevant areas of (recent) research in
unknown data sources in an exploratory way. We want to ob-                    the DB community:
serve the paradigm of progressive data science through the                         • Approximate Query Processing
lens of database systems, more specifically considering exis-
ting technologies which can be exploited to support progres-                       • re-use of (partial) results and sampling that takes into
sive systems. In this paper we briefly investigate the simila-                       account rare subpopulations
rities of PVA to approximate query processing and present
an indexing strategy which can be utilized in approxima-                           • self-adapting (data organization, indexing, etc.) data
te and progressive environments without conflicting with its                         structures
intrinsic requirements.
                                                                                   • speculative query processing

Keywords                                                                           • (progressive) data wrangling and cleaning
progressive visual analytics, approximate query processing,                      In this paper, we want to take a more detailed look at
indexing, sampling                                                            the application of progressive indexing [13] in the context of
                                                                              PVA. In particular, we will focus on a self-adapting index,
                                                                              touching the areas of approximate querying and sampling
1.    INTRODUCTION                                                            to support the requirements of progressive analysis tasks.
   Progressive Visual Analytics (PVA) [4, 17] is a relatively                 The main contributions of the paper is to demonstrate the
new paradigm in the realm of visualization. It’s main objec-                  synergy between approximate (and by similarity of the con-
tive is to develop algorithms and infrastructure to support                   cepts progressive) querying approaches and progressive in-
analysts in exploratory ad hoc data analysis. This means                      dexing. Additionally, we present a sampling and indexing
each query should return an (approximate) result within an                    strategy useful in realizing systems for progressive query en-
upper time bound, so that data exploration can be consi-                      vironments. The outline of our paper is as follows: First we
dered a real-time process. Additionally, the analyst should                   want to give an overview about the paradigm of PVA and
be able to steer the query by tuning parameters of the com-                   take a look at which existing data base technologies can be
putation. The key principle of the progressive paradigm is                    utilized to support the progressive paradigm. In sectiion 3
to instantly return an approximate result which is (progres-                  we present our approach of combining a strategy involving
sively) updated in the background. Ideally some notion of                     approximate query processing and construction of a progres-
(partial) re-use of earlier results that intersect with a live                sive index to support aspects of ”progressiveness”regarding
query should be in place to keep a certain continuity in a                    fast, reliable intermediate results. We then present an eva-
computation and further reduce the response time of later                     luation of our efforts and in section 4. We close the paper
queries. Many recent papers on PVA ([3, 7, 8, 10]) have                       with a conclusion and outlook on future work.
put the focus on features and requirements of progressive
                                                                              2.    RELATED WORK AND BACKGROUND
                                                                                 The iterative nature of the progressive computation can
                                                                              be found in a number of already existing computational con-
                                                                              cepts. The most similar concepts that come to mind are
                                                                              Streaming, Online, and Iterative algorithms. In the realm of
                                                                              data base applications, the concept of Online Aggregation is
32nd GI-Workshop on Foundations of Databases (Grundlagen von Daten-           closely related [9, 12]. Essentially, the field of PVA does not
Copyright ©
banken), September 01-03, 2021, Munich, Germany.
              2021 for this paper by its authors. Use permitted under Crea-
tive Commons License Attribution 4.0 International (CC BY 4.0).
                                                                              introduce a completely new computational paradigm, but
                                                                              describes a set of features a computational process should
fulfill to to be classified as progressive, Stolper, Aupetit, and
Fekete et al. [7] comprised a definition of features a progres-
sive computation must possess. The computation should:
  1. Be bounded on time and data.

  2. Report intermediate outputs: a result, and measures
     of quality and progress.

  3. Be bounded on latency.                                         Figure 1: Original Indexing Approach of Holanda et al. The
                                                                    entries are all tuples which will be returned for a running
  4. Converge towards the true result.
                                                                    query. [13]
  5. Be controllable by a user during execution.
   From the perspective of data base systems, this leads to            On a basic level, the indexer by Holanda et al. takes a frac-
the following ”progressive”requirements for a system or in-         tion δ of the data of and adds them to a growing index (see
frastructure designed for analytical components that adhere         Figure 1). The cost of index creation is ßmeared outöver the
to the design goals mentioned above.                                run time of queries issued to the data base. For a standard
                                                                    query environment, this means that initially queries will ta-
  1. Suitable sampling algorithms with outlier detection.           ke rather long, as early inquiries to the data base can not
                                                                    benefit from indexes and have to scan all tuples sequentially.
  2. Fast approximate algorithms.

  3. Efficient (partial) reuse to bring continuity into a stee-        More precisely, Holanda’s approach is divided into three
     red computation.                                               phases. In the Creation Phase a (standard, non approxima-
                                                                    te) query first performs an index lookup on the fraction of
  4. Metrics on quality and progression of the computation.         indexed data (ρ). Now, the not-yet-indexed 1 − ρ fraction of
                                                                    the original column is scanned while expanding the index by
  5. Guarantees on user given time-bounds to the first re-          a chosen fraction δ of the column size. For the index creati-
     sult and to new results.                                       on Holanda et al. use one of four sorting algorithms: Quick-
                                                                    sort, Bucketsort, Radixsort (MSD), and Radixsort (LSD).
  6. Supporting structures to aid data analysts (Indexes,
                                                                    Depending on the utilized Indexing method, the partial in-
     Meta Data, etc.).
                                                                    dex performs better for different querying scenarios. Once
   In the context of this paper, we focus on requirements           all data is indexed the approach enters the Refinement Pha-
1, 2, and 6: Sampling with outlier detection, approximate           se. There, queries can be performed without scanning any
querying approaches, and indexing techniques to support             non-indexed data. Additionally, the existing, rough index is
progressive exploration tasks. More specifically, we propo-         refined, progressively converging towards a fully ordered in-
se a progressively built index which is constructed while a         dex. Lastly, in the Consolidation Phase the ordered index
user poses fast approximate queries in a real time fashion.         is now progressively converted to a B+ tree for query effi-
By real time we mean, that queries should return a result           ciency. For an example, please refer to Figure 2 for a short
below the interactivity threshold of 500ms, as proposed by          overview of the quicksort indexing appproach of Holanda et
Liu et al [15]. For our approach, we investigated the work          al.
of Holanda et al. [13], which describes an index which is              While useful in an exact, non approximate querying set-
built progressively while querying a data base, and aim to          ting, Holanda states that progressive indexing techniques
integrate this indexing technique with Approximative Query          most likely will synergize well with approximate querying
Processing (AQP). Generally, there are two major categories         techniques, since the problem of initially longer running que-
of AQP approaches. online aggregation [12] and sampling [1].        ries can be remedied by settling for faster, approximate re-
Sampling approaches (e.g. BlinkDB [2], AQUA [16]) requi-            sults before an index structure exists.
re preprocessing to deliver statistically significant results on
lower populations, which clashes with the dynamic nature            3.   PROGRESSIVE INDEXING IN THE CON-
of exploratory data analysis. Online aggregation (e.g. CON-              TEXT OF APPROXIMATE QUERY PRO-
TROL [11], DBO [14]) struggles to give reliable approxima-
tions for rare subpopulations of a large data set. To remedy             CESSING.
these weaknesses, we studied the work of Galakatos et al.              The premise of progressive indexing is to iteratively crea-
[9] who employs uniform sampling in tandem with rare sub-           te a database index on a previously unknown data sour-
population detection to amount for data skew and outliers.          ce and ßmear out”the cost of doing so over several running
We use a similar approach, but combine it with parallel pre-        queries. We want to explore if progressive indexing can be
processing to collect distributional metadata of the data set       used alongside and/or benefit from a progressive querying
to be explored in order to improve later samples. We bridge         approach. Since building a complete index can take a long
the non-interactive preprocessing gap with uniform samp-            time, doing so is kind of antithetical for an ad-hoc explora-
ling (which requires no initial preprocessing), refined with        tory query session where fast (approximate) results on an
rare subpopulation detection. As such, we are able to quick-        unexplored and unprepared data source are the focus. The
ly answer queries with rare populations and data skew taken         progressive indexing will incrementally build an index in par-
into account, while samples taken later accurately represent        allel to interactive query sessions, speeding up certain que-
the real data.                                                      ries, and also preparing the data base for queries requiring an
                                                                  Figure 3: Flowchart of the interaction between query engine
Figure 2: Example of progressive Indexing using Quicksort
                                                                  and sampler.
showing the Creation, Refinement, and Consolidation phases.
[13]


exact result. The synergy with approximate querying largely
stems from the fact, that approximate / progressive queries
don’t have to scan the complete data base for a query which
is fired against a data base with no index whatsoever. If only
a subset of tuples is retrieved initially, the indexing can be
done in a more dynamic manner alongside of a fluid ”real ti-
meüser interaction from the start. To explore this venue, we
built a prototype comparing the suitability of various samp-      Figure 4: Modified Indexing Approach of Holanda et al. To
ling / indexing techniques based on Holanda’s work in the         account for Approximate Querying Techniques.
scope ([19]) of approximate querying applications. Rough-
ly explained, our approach first enters a preparatory stage
in which facilities for the progressive index are created and     a subset of the complete data set and does not have to scan
meta-data to ensure statistically significant samples is col-     the complete data set.
lected. After this initial stage the partial index is in place       Approximate results require a random sample to be sta-
and populated in parallel to approximate querying sessions.       tistically relevant. Taking just the first δ rows as a sample
To keep the system interactive, during the preparatory sta-       like in the original implementation would not return good
ge queries can be answered in parallel, albeit not with data      results. As Galakatos et al. point out [6], we require some
skew or outliers taken into account.                              initial knowledge about a data set to be able to reach some
   The following features are our main contributions which        modicum of statistical significance with standard sampling
differentiate our approach to the work of Holanda et al. First,   methods (like the total row count for example). To solve
we keep an absolute number s of tuples instead of a certain       this dilemma, we utilize Reservoir Sampling [20] to create
fraction δ of the complete data set to determine how many         the first sample. Since reservoir sampling misses any skew
tuples to index, since we have no clue how many rows the          or outliers in the data, we combine the Reservoir Sampling
data set to be inspected has. Also, not only do we partially      technique with preprocessing of the data to be streamed to
index the complete set per query, we also employ approxima-       make the following samples more representative of the real
te query processing and trade accuracy of results for speed       data set. We count the number of rows of the data set as and
concerning the answering of queries during all stages of que-     record the count of attribute groups to get meta data about
rying. In the same vein, we supplement our sampling approa-       a columns distribution and the replacement probability for
ches with rare population indexing to improve the statistical     each row, as shown in Algorithm 1. While a preprocessing
significance of samples taken from the original data.             step might seem to go against to the requirements of the
   For a high-level overview of the architecture of the query     progressive paradigm, we deem it acceptable as queries can
and sampling engine, please refer to Figure 3.                    already be answered during the preprocessing, albeit with
                                                                  less accurate results.
3.1    Initial Sampling and Preprocessing
  Since in an approximate environment the cardinality of          3.1.1    Rare Population Indexing
the complete data base is unknown, it is impossible to just          Together with the reservoir sample, we build another sam-
determine some fraction δ of the complete data set which          ple that takes rare attribute values into account. To do this,
should be indexed per query. Since for our case we always         we designate a certain fraction θ as a threshold for a value
take (increasing) samples of the complete data, we index s        to count as rare. For this we use the current row count and
many tuples, where s is the size of the sample (See Figure 4).    attribute distribution gained during the reservoir preproces-
A nice side effect of this approach is, that an initial partial   sing. Any attribute appearing for a fraction of less than θ
index is available much earlier, as the query only processes      of the row count will be sampled separately. Since the row
count increases during iteration, we have to continuously
check if we still consider an attribute rare. Once an attribu-
te is deemed rare, it will be removed from the rare sample
pool if it appears more often than (θ ∗2)∗current rowcount.
Using this method will provide sufficient entries in the sam-
ple for rare groups and thus more interesting results, even in
the early phases of sampling. The rare samples however will
be discarded when the system switches to static sampling to
guarantee the correct distribution and proper randomness
(See Algorithm 1).

   Algorithm 1: Modified Reservoir Sampling
    Input: Data stream to record distribution from.
    Output: Random uniform sample of dataset, random
                                                                 Figure 5: Creation of the Adjusted Stratified Sample based
              samples from rare populations.
                                                                 on the distribution of the age column
 1 rowcount ← 0;
 2 rareThreshold ← 0;
 3 groupCounts ← initialize dictionary;
 4 sample ← initialize sample;                                   tial index tree is created. For this we have implemented the
 5 for each row in datastream do                                 Progressive Radixsort (MSD) method, similar to Holanda
         ¯         ¯                                             et al. [13]. From the initial preprocessing, we create a B+
  6    groupCounts ← occurenec of att. value;
  7    rowCount += 1;                                            tree index structure for each column in accordance of the
  8    if length of sample < max. sample size then               amount of different attribute instances we observed. Into
  9         add row to sample;                                   this index tree the original reservoir sample is then inserted.
10     else                                                      Now, this index tree will bee populated with more samples
11          replacement prob. ← sample size / rowcount;          over time, utilizing any down-time during a user session. As
12          if value gets replaced then                          mentioned earlier in this section, we collected metadata du-
13              replacedRow ← rand. row in curr. sample;         ring the initial sampling step about the distribution of the
14              replace row with new row;                        data. Thus, all future samples which are used to populate
15              if replacedRow is rare then                      the index structure can be chosen so that they reflect the
 16                 resample replaced row in separate            distribution of the original data using a stratified sampling
                     sample                                      [5] approach. Data samples are thus pulled randomly from
17              end                                              their respective groups and not just randomly overall. Once
18          else                                                 an (approximate) query is issued, the progressive indexing
19              if row is a rare entry then                      is paused, and the query is only answered by tuples already
 20                 resample row in separate sample              indexed. If an extreme skew is present, it can easily happen
21              end                                              that samples would include less than one entry to reflect the
22          end                                                  correct distribution. We opt to include at least one of the-
23     end                                                       se very rare datapoints with each sample as illustrated in
24     rareThreshold ← rowCount * θ;                             Figure 5, since outliers often are a point of interest for ex-
25 end
                                                                 ploratory data analysis. The distribution in the index is thus
                                                                 slightly off in the beginning, but since the implementation
                                                                 keeps pulling samples continuously, this error will correct
                                                                 itself quickly over time. This also guarantees that enough
3.2    Early Query Executions                                    rare entries are available early to create meaningful appro-
   Early queries encompass all queries that are executed be-     ximations. When the samples are generated, they are pul-
fore any index is built. The set of tuples these queries are     led and indexed sequentially until the data is exhausted. At
executed on is a combination of the existing random uni-         this point, the index is fully built and all future queries will
form reservoir sample and parts of the random samples for        run with the full index. Note, that while later approxima-
rare populations. If the tuples of the reservoir sample con-     te queries take longer due to the ever increasing number of
tain too few of any rare attribute value, tuples contained       stratified tuples entered into the index, the gain in result
in the separate index of rare populations will be randomly       accuracy is usually worth it. While Holanda’s original ap-
added so that they are not overlooked. This will produce         proach only refines the coarse index into a B+ tree after all
more reliable results than straight reservoir sampling, even     data has been indexed, we chose to refine the index immedia-
if they are somewhat biased. Mind this mode of querying is       tely after indexing a new fragment of the original data. The
only temporary, as it is only included to bridge the time nee-   small overhead of immediate refinement to a full B+ tree
ded to preprocess the data set and create the initial index      allows for much faster queries than on a coarse index. The-
structure which incorporates a stratified sampling approach      re queries would quickly reach undesirable delays if issued
to take into account data skew and rare populations.             over large indexes. If the index becomes too large that que-
                                                                 rying it completely becomes too costly, we can resort either
3.3    Index Generation                                          to querying only a statistically relevant sample of the index
  After having created an initial random sample and a full       or to pruning the index to a subset of the original data. To
pass over the data to collect distributional metadata, a par-    keep these approximate queries or pruned indexes statisti-
Figure 6: Distribution of the age value of the test data set.    Figure 7: Distribution of the salary value of the test data set.

cally relevant, we reference the distributional metadata to
create an appropriate approximation of the original data. It
should be obvious, that at this point, the index structure is
not only useful for approximate queries, but will also serve
in scenarios where exact results are required.

4.    EVALUATION
  Since it is not really interesting if the queries themselves
are faster for an approximate setting instead of a contempo-
rary setting, we will focus on comparing the time until the
progressive indexing converges to a full query compared to
the approach of Holanda et al., and how well the stratified
index manages to account for data skew while uncompleted.

4.1   Setup
  For the evaluation, we have implemented the system in
Python version 3.8.3. The experiments were run on a ma-
chine with an Intel 6700k processor at 4.4GHz and 32GB of        Figure 8: Duration for the B+ index reaching full convergence
DDR4 RAM at 2133MHz.                                             while repeatedly issuing the simple query (age ≥ 32).

4.2   Data
  We have generated a dataset with five million rows contai-     seconds to a fully converted index, while our approach only
ning data about the age, salary and received tips of waiters.    takes 18 seconds. We thus reach index convergence roughly
To test the indexing prototype, we made sure to integrate        22 times faster than the original approach. This is mostly
different value distributions and amount of skew into the        due to the fact that the original approach relies on queries
data. The tip tuples are uniformly distributed. The distri-      always ranging over the complete data base to further gene-
bution of the age and salary values (see Figures 6 and 7) is     rate the index. As a result, the index cannot be constructed
each skewed towards a particular value range.                    further until the current query has finished processing all
                                                                 tuples. Our solution on the other hand utilizes the much
4.3   Time until Index Convergence                               shorter time to return of approximate queries to be able to
   For this experiment, we implemented progressive indexing      quickly construct an index.
in a similar fashion as Holanda et al. demonstrated in their
approach (specifically the Radix MSD variation in its base       4.4    Accounting for Data Skew
form) and ran it on the same data set. Only one index on the        In this experiment, we looked at how well our sampling ap-
‘age’ column was constructed. Each second we repeatedly is-      proach can handle data skew. For this, we ran a count(’age’)
sued a simple query (age ≥ 32) until the index converged.        query run once with an initial reservoir sample generated in
We did not implement the adaptive indexing budget pro-           the first step before a stratified index is built, and after-
posed by Holanda et al., but instead set the to-be-indexed       wards on a reservoir sample including rare subpopulation
fraction with each query to 10%. The sample size for our         sampling. The sample size for this test was 5000. Figure 9
version was again set to 5000. In Figure 8 we can see that       shows the pitfall of standard reservoir sampling. Since the
a progressive indexing approach greatly benefits from ap-        values 25 and 26 represent the bulk of the entries, other ent-
proximate querying techniques. The base version takes 404        ries are easily missed. Especially the rare outliers are excee-
                                                                   get was used to automatically adjust how much of the query
                                                                   time should be used for indexing. We could imagine a simi-
                                                                   lar solution for automatically determining the sample size for
                                                                   the progressive index in an approximate query environment.
                                                                   For the stratified sampling, we chose a very simple solution
                                                                   to determine how the samples should be generated by just
                                                                   relying on the skew of one column. In a real environment,
                                                                   this might not be viable since other columns might be just
                                                                   as interesting. Creating a more refined scheme to account
                                                                   for skew found in several attributes would further increase
                                                                   the reliability of the partial results returned by approximate
                                                                   queries with our proposed indexing strategy.
Figure 9: Distribution of a sample of salary values taken with
reservoir sampling.
                                                                   6.   REFERENCES
                                                                    [1] S. Acharya, P. B. Gibbons, and V. Poosala.
                                                                        Congressional samples for approximate answering of
                                                                        group-by queries. In W. Chen, J. F. Naughton, and
                                                                        P. A. Bernstein, editors, Proceedings of the 2000 ACM
                                                                        SIGMOD International Conference on Management of
                                                                        Data, May 16-18, 2000, Dallas, Texas, USA, pages
                                                                        487–498. ACM, 2000.
                                                                    [2] S. Agarwal, A. Panda, B. Mozafari, S. Madden, and
                                                                        I. Stoica. Blinkdb: Queries with bounded errors and
                                                                        bounded response times on very large data. CoRR,
                                                                        abs/1203.5485, 2012.
                                                                    [3] M. Angelini, T. May, G. Santucci, and H.-J. Schulz.
Figure 10: Distribution of a sample of salary values taken with         On quality indicators for progressive visual analytics.
stratified sampling.                                                    06 2019.
                                                                    [4] M. Angelini, G. Santucci, H. Schumann, and
                                                                        H. Schulz. A review and characterization of
dingly unlikely to be included in the initial sample. Figure            progressive visual analytics. Informatics, 5(3):31, 2018.
10 shows the impact rare population sampling strategy can           [5] M. P. Cohen. Stratified sampling. In M. Lovric, editor,
have on a sample of the same size. Even exceedingly rare                International Encyclopedia of Statistical Science,
outliers (age 30-36) are included in the sample. While not              pages 1547–1550. Springer, 2011.
correct, this sample composition is a much better represen-         [6] K. Dursun, C. Binnig, U. Çetintemel, and T. Kraska.
tation of the real data set. Here we can see the advantage              Revisiting reuse in main memory database systems.
of including rare population sampling in addition to reser-             CoRR, abs/1608.05678, 2016.
voir sampling until we have a true stratified sample. When          [7] J. Fekete, D. Fisher, A. Nandi, and M. Sedlmair.
the stratified index structure is built later, data skew largely        Progressive data analysis and visualization (dagstuhl
stops being a problem as we now know the distribution of                seminar 18411). Dagstuhl Reports, 8(10):1–40, 2018.
values. As shown in Figure 5, the distribution is not exact         [8] J. Fekete and R. Primet. Progressive analytics: A
simply because we can not include fractions of tuples, but              computation paradigm for exploratory data analysis.
good enough to give a quite accurate picture of the complete            CoRR, abs/1607.05162, 2016.
data.                                                               [9] A. Galakatos, A. Crotty, E. Zgraggen, C. Binnig, and
                                                                        T. Kraska. Revisiting reuse for approximate query
5.   CONCLUSION AND OUTLOOK                                             processing. PVLDB, 10(10):1142–1153, 2017.
   For this work we prototyped a progressive indexing ap-          [10] A. Gogolou, T. Tsandilas, T. Palpanas, and
proach usable in an approximate querying environment in-                A. Bezerianos. Progressive similarity search on time
spired by a Progressive Indexing approach by Holanda et                 series data. In Proceedings of the Workshops of the
al. [13]. The AQP environment leads to significantly lower              EDBT/ICDT 2019 Joint Conference, EDBT/ICDT
query execution times, even when the data is not fully in-              2019, Lisbon, Portugal, March 26, 2019, 2019.
dexed. To make these approximate results more significant,         [11] J. M. Hellerstein, R. Avnur, A. Chou, C. Hidber,
we combined reservoir sampling, meta data collection (rare              C. Olston, V. Raman, T. Roth, and P. J. Haas.
population sampling) and a stratified sampling approach to              Interactive data analysis: The control project.
remedy for the intrinsic inaccuracy of approximate queries.             Computer, 32(8):51–59, 1999.
Our evaluation results confirm Holanda’s assumption that           [12] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online
the progressive indexing approach benefits greatly from the             aggregation. In J. Peckham, editor, SIGMOD 1997,
much faster time to return of approximate queries on non-               Proceedings ACM SIGMOD International Conference
indexed data. For future work on this topic, we identified two          on Management of Data, May 13-15, 1997, Tucson,
major areas of potential improvements: Including an inde-               Arizona, USA, pages 171–182. ACM Press, 1997.
xing budget, and expanding the stratified sampling strategy.       [13] P. Holanda, S. Manegold, H. Mühleisen, and
In the original approach by Holanda et al., an indexing bud-            M. Raasveldt. Progressive indexes: Indexing for
     interactive data analysis. PVLDB, 12(13):2366–2378,      [17] C. D. Stolper, A. Perer, and D. Gotz. Progressive
     2019.                                                         visual analytics: User-driven visual exploration of
[14] C. M. Jermaine, S. Arumugam, A. Pol, and A. Dobra.            in-progress analytics. IEEE Transactions on
     Scalable approximate query processing with the DBO            Visualization and Computer Graphics,
     engine. In C. Y. Chan, B. C. Ooi, and A. Zhou,                20(12):1653–1662, 2014.
     editors, Proceedings of the ACM SIGMOD                   [18] C. Turkay, N. Pezzotti, C. Binnig, H. Strobelt,
     International Conference on Management of Data,               B. Hammer, D. A. Keim, J. Fekete, T. Palpanas,
     Beijing, China, June 12-14, 2007, pages 725–736.              Y. Wang, and F. Rusu. Progressive data science:
     ACM, 2007.                                                    Potential and challenges. CoRR, abs/1812.08032, 2018.
[15] Z. Liu and J. Heer. The effects of interactive latency   [19] M. van den Berg. Applying progressive indexing in the
     on exploratory visual analysis. IEEE Trans. Vis.              context of approximate query processing. 2021.
     Comput. Graph., 20(12):2122–2131, 2014.                  [20] J. S. Vitter. Random sampling with a reservoir. ACM
[16] T. K. Sellis. Review - the aqua approximate query             Trans. Math. Softw., 11(1):37–57, 1985.
     answering system. ACM SIGMOD Digit. Rev., 2, 2000.