=Paper=
{{Paper
|id=Vol-3075/paper5
|storemode=property
|title=Progressive Indexing for Interactive Analytics
|pdfUrl=https://ceur-ws.org/Vol-3075/paper5.pdf
|volume=Vol-3075
|authors=Michael Hohenstein
|dblpUrl=https://dblp.org/rec/conf/gvd/Hohenstein21
}}
==Progressive Indexing for Interactive Analytics==
Progressive Indexing for Interactive Analytics Michael Hohenstein Technical University of Kaiserslautern Erwin-Schrödinger-Straße, 67663 Kaiserslautern Kaiserslautern, Germany hohenstein@cs.uni-kl.de ABSTRACT algorithms and have investigated potentials and future chal- Progressive Visual Analytics (PVA) is a recent paradigm in lenges of the paradigm. the realm of visualization. PVA is closely related to appro- Turkay et al. have discussed potentials and challenges of a ximate query processing and streaming applications with a progressive data science approach in the field of Machine focus on real-time interactivity and features supporting a Learning, Data Base Systems and Visualization [18]. They fluid and insightful experience for data analysts working on identify the following relevant areas of (recent) research in unknown data sources in an exploratory way. We want to ob- the DB community: serve the paradigm of progressive data science through the • Approximate Query Processing lens of database systems, more specifically considering exis- ting technologies which can be exploited to support progres- • re-use of (partial) results and sampling that takes into sive systems. In this paper we briefly investigate the simila- account rare subpopulations rities of PVA to approximate query processing and present an indexing strategy which can be utilized in approxima- • self-adapting (data organization, indexing, etc.) data te and progressive environments without conflicting with its structures intrinsic requirements. • speculative query processing Keywords • (progressive) data wrangling and cleaning progressive visual analytics, approximate query processing, In this paper, we want to take a more detailed look at indexing, sampling the application of progressive indexing [13] in the context of PVA. In particular, we will focus on a self-adapting index, touching the areas of approximate querying and sampling 1. INTRODUCTION to support the requirements of progressive analysis tasks. Progressive Visual Analytics (PVA) [4, 17] is a relatively The main contributions of the paper is to demonstrate the new paradigm in the realm of visualization. It’s main objec- synergy between approximate (and by similarity of the con- tive is to develop algorithms and infrastructure to support cepts progressive) querying approaches and progressive in- analysts in exploratory ad hoc data analysis. This means dexing. Additionally, we present a sampling and indexing each query should return an (approximate) result within an strategy useful in realizing systems for progressive query en- upper time bound, so that data exploration can be consi- vironments. The outline of our paper is as follows: First we dered a real-time process. Additionally, the analyst should want to give an overview about the paradigm of PVA and be able to steer the query by tuning parameters of the com- take a look at which existing data base technologies can be putation. The key principle of the progressive paradigm is utilized to support the progressive paradigm. In sectiion 3 to instantly return an approximate result which is (progres- we present our approach of combining a strategy involving sively) updated in the background. Ideally some notion of approximate query processing and construction of a progres- (partial) re-use of earlier results that intersect with a live sive index to support aspects of ”progressiveness”regarding query should be in place to keep a certain continuity in a fast, reliable intermediate results. We then present an eva- computation and further reduce the response time of later luation of our efforts and in section 4. We close the paper queries. Many recent papers on PVA ([3, 7, 8, 10]) have with a conclusion and outlook on future work. put the focus on features and requirements of progressive 2. RELATED WORK AND BACKGROUND The iterative nature of the progressive computation can be found in a number of already existing computational con- cepts. The most similar concepts that come to mind are Streaming, Online, and Iterative algorithms. In the realm of data base applications, the concept of Online Aggregation is 32nd GI-Workshop on Foundations of Databases (Grundlagen von Daten- closely related [9, 12]. Essentially, the field of PVA does not Copyright © banken), September 01-03, 2021, Munich, Germany. 2021 for this paper by its authors. Use permitted under Crea- tive Commons License Attribution 4.0 International (CC BY 4.0). introduce a completely new computational paradigm, but describes a set of features a computational process should fulfill to to be classified as progressive, Stolper, Aupetit, and Fekete et al. [7] comprised a definition of features a progres- sive computation must possess. The computation should: 1. Be bounded on time and data. 2. Report intermediate outputs: a result, and measures of quality and progress. 3. Be bounded on latency. Figure 1: Original Indexing Approach of Holanda et al. The entries are all tuples which will be returned for a running 4. Converge towards the true result. query. [13] 5. Be controllable by a user during execution. From the perspective of data base systems, this leads to On a basic level, the indexer by Holanda et al. takes a frac- the following ”progressive”requirements for a system or in- tion δ of the data of and adds them to a growing index (see frastructure designed for analytical components that adhere Figure 1). The cost of index creation is ßmeared outöver the to the design goals mentioned above. run time of queries issued to the data base. For a standard query environment, this means that initially queries will ta- 1. Suitable sampling algorithms with outlier detection. ke rather long, as early inquiries to the data base can not benefit from indexes and have to scan all tuples sequentially. 2. Fast approximate algorithms. 3. Efficient (partial) reuse to bring continuity into a stee- More precisely, Holanda’s approach is divided into three red computation. phases. In the Creation Phase a (standard, non approxima- te) query first performs an index lookup on the fraction of 4. Metrics on quality and progression of the computation. indexed data (ρ). Now, the not-yet-indexed 1 − ρ fraction of the original column is scanned while expanding the index by 5. Guarantees on user given time-bounds to the first re- a chosen fraction δ of the column size. For the index creati- sult and to new results. on Holanda et al. use one of four sorting algorithms: Quick- sort, Bucketsort, Radixsort (MSD), and Radixsort (LSD). 6. Supporting structures to aid data analysts (Indexes, Depending on the utilized Indexing method, the partial in- Meta Data, etc.). dex performs better for different querying scenarios. Once In the context of this paper, we focus on requirements all data is indexed the approach enters the Refinement Pha- 1, 2, and 6: Sampling with outlier detection, approximate se. There, queries can be performed without scanning any querying approaches, and indexing techniques to support non-indexed data. Additionally, the existing, rough index is progressive exploration tasks. More specifically, we propo- refined, progressively converging towards a fully ordered in- se a progressively built index which is constructed while a dex. Lastly, in the Consolidation Phase the ordered index user poses fast approximate queries in a real time fashion. is now progressively converted to a B+ tree for query effi- By real time we mean, that queries should return a result ciency. For an example, please refer to Figure 2 for a short below the interactivity threshold of 500ms, as proposed by overview of the quicksort indexing appproach of Holanda et Liu et al [15]. For our approach, we investigated the work al. of Holanda et al. [13], which describes an index which is While useful in an exact, non approximate querying set- built progressively while querying a data base, and aim to ting, Holanda states that progressive indexing techniques integrate this indexing technique with Approximative Query most likely will synergize well with approximate querying Processing (AQP). Generally, there are two major categories techniques, since the problem of initially longer running que- of AQP approaches. online aggregation [12] and sampling [1]. ries can be remedied by settling for faster, approximate re- Sampling approaches (e.g. BlinkDB [2], AQUA [16]) requi- sults before an index structure exists. re preprocessing to deliver statistically significant results on lower populations, which clashes with the dynamic nature 3. PROGRESSIVE INDEXING IN THE CON- of exploratory data analysis. Online aggregation (e.g. CON- TEXT OF APPROXIMATE QUERY PRO- TROL [11], DBO [14]) struggles to give reliable approxima- tions for rare subpopulations of a large data set. To remedy CESSING. these weaknesses, we studied the work of Galakatos et al. The premise of progressive indexing is to iteratively crea- [9] who employs uniform sampling in tandem with rare sub- te a database index on a previously unknown data sour- population detection to amount for data skew and outliers. ce and ßmear out”the cost of doing so over several running We use a similar approach, but combine it with parallel pre- queries. We want to explore if progressive indexing can be processing to collect distributional metadata of the data set used alongside and/or benefit from a progressive querying to be explored in order to improve later samples. We bridge approach. Since building a complete index can take a long the non-interactive preprocessing gap with uniform samp- time, doing so is kind of antithetical for an ad-hoc explora- ling (which requires no initial preprocessing), refined with tory query session where fast (approximate) results on an rare subpopulation detection. As such, we are able to quick- unexplored and unprepared data source are the focus. The ly answer queries with rare populations and data skew taken progressive indexing will incrementally build an index in par- into account, while samples taken later accurately represent allel to interactive query sessions, speeding up certain que- the real data. ries, and also preparing the data base for queries requiring an Figure 3: Flowchart of the interaction between query engine Figure 2: Example of progressive Indexing using Quicksort and sampler. showing the Creation, Refinement, and Consolidation phases. [13] exact result. The synergy with approximate querying largely stems from the fact, that approximate / progressive queries don’t have to scan the complete data base for a query which is fired against a data base with no index whatsoever. If only a subset of tuples is retrieved initially, the indexing can be done in a more dynamic manner alongside of a fluid ”real ti- meüser interaction from the start. To explore this venue, we built a prototype comparing the suitability of various samp- Figure 4: Modified Indexing Approach of Holanda et al. To ling / indexing techniques based on Holanda’s work in the account for Approximate Querying Techniques. scope ([19]) of approximate querying applications. Rough- ly explained, our approach first enters a preparatory stage in which facilities for the progressive index are created and a subset of the complete data set and does not have to scan meta-data to ensure statistically significant samples is col- the complete data set. lected. After this initial stage the partial index is in place Approximate results require a random sample to be sta- and populated in parallel to approximate querying sessions. tistically relevant. Taking just the first δ rows as a sample To keep the system interactive, during the preparatory sta- like in the original implementation would not return good ge queries can be answered in parallel, albeit not with data results. As Galakatos et al. point out [6], we require some skew or outliers taken into account. initial knowledge about a data set to be able to reach some The following features are our main contributions which modicum of statistical significance with standard sampling differentiate our approach to the work of Holanda et al. First, methods (like the total row count for example). To solve we keep an absolute number s of tuples instead of a certain this dilemma, we utilize Reservoir Sampling [20] to create fraction δ of the complete data set to determine how many the first sample. Since reservoir sampling misses any skew tuples to index, since we have no clue how many rows the or outliers in the data, we combine the Reservoir Sampling data set to be inspected has. Also, not only do we partially technique with preprocessing of the data to be streamed to index the complete set per query, we also employ approxima- make the following samples more representative of the real te query processing and trade accuracy of results for speed data set. We count the number of rows of the data set as and concerning the answering of queries during all stages of que- record the count of attribute groups to get meta data about rying. In the same vein, we supplement our sampling approa- a columns distribution and the replacement probability for ches with rare population indexing to improve the statistical each row, as shown in Algorithm 1. While a preprocessing significance of samples taken from the original data. step might seem to go against to the requirements of the For a high-level overview of the architecture of the query progressive paradigm, we deem it acceptable as queries can and sampling engine, please refer to Figure 3. already be answered during the preprocessing, albeit with less accurate results. 3.1 Initial Sampling and Preprocessing Since in an approximate environment the cardinality of 3.1.1 Rare Population Indexing the complete data base is unknown, it is impossible to just Together with the reservoir sample, we build another sam- determine some fraction δ of the complete data set which ple that takes rare attribute values into account. To do this, should be indexed per query. Since for our case we always we designate a certain fraction θ as a threshold for a value take (increasing) samples of the complete data, we index s to count as rare. For this we use the current row count and many tuples, where s is the size of the sample (See Figure 4). attribute distribution gained during the reservoir preproces- A nice side effect of this approach is, that an initial partial sing. Any attribute appearing for a fraction of less than θ index is available much earlier, as the query only processes of the row count will be sampled separately. Since the row count increases during iteration, we have to continuously check if we still consider an attribute rare. Once an attribu- te is deemed rare, it will be removed from the rare sample pool if it appears more often than (θ ∗2)∗current rowcount. Using this method will provide sufficient entries in the sam- ple for rare groups and thus more interesting results, even in the early phases of sampling. The rare samples however will be discarded when the system switches to static sampling to guarantee the correct distribution and proper randomness (See Algorithm 1). Algorithm 1: Modified Reservoir Sampling Input: Data stream to record distribution from. Output: Random uniform sample of dataset, random Figure 5: Creation of the Adjusted Stratified Sample based samples from rare populations. on the distribution of the age column 1 rowcount ← 0; 2 rareThreshold ← 0; 3 groupCounts ← initialize dictionary; 4 sample ← initialize sample; tial index tree is created. For this we have implemented the 5 for each row in datastream do Progressive Radixsort (MSD) method, similar to Holanda ¯ ¯ et al. [13]. From the initial preprocessing, we create a B+ 6 groupCounts ← occurenec of att. value; 7 rowCount += 1; tree index structure for each column in accordance of the 8 if length of sample < max. sample size then amount of different attribute instances we observed. Into 9 add row to sample; this index tree the original reservoir sample is then inserted. 10 else Now, this index tree will bee populated with more samples 11 replacement prob. ← sample size / rowcount; over time, utilizing any down-time during a user session. As 12 if value gets replaced then mentioned earlier in this section, we collected metadata du- 13 replacedRow ← rand. row in curr. sample; ring the initial sampling step about the distribution of the 14 replace row with new row; data. Thus, all future samples which are used to populate 15 if replacedRow is rare then the index structure can be chosen so that they reflect the 16 resample replaced row in separate distribution of the original data using a stratified sampling sample [5] approach. Data samples are thus pulled randomly from 17 end their respective groups and not just randomly overall. Once 18 else an (approximate) query is issued, the progressive indexing 19 if row is a rare entry then is paused, and the query is only answered by tuples already 20 resample row in separate sample indexed. If an extreme skew is present, it can easily happen 21 end that samples would include less than one entry to reflect the 22 end correct distribution. We opt to include at least one of the- 23 end se very rare datapoints with each sample as illustrated in 24 rareThreshold ← rowCount * θ; Figure 5, since outliers often are a point of interest for ex- 25 end ploratory data analysis. The distribution in the index is thus slightly off in the beginning, but since the implementation keeps pulling samples continuously, this error will correct itself quickly over time. This also guarantees that enough 3.2 Early Query Executions rare entries are available early to create meaningful appro- Early queries encompass all queries that are executed be- ximations. When the samples are generated, they are pul- fore any index is built. The set of tuples these queries are led and indexed sequentially until the data is exhausted. At executed on is a combination of the existing random uni- this point, the index is fully built and all future queries will form reservoir sample and parts of the random samples for run with the full index. Note, that while later approxima- rare populations. If the tuples of the reservoir sample con- te queries take longer due to the ever increasing number of tain too few of any rare attribute value, tuples contained stratified tuples entered into the index, the gain in result in the separate index of rare populations will be randomly accuracy is usually worth it. While Holanda’s original ap- added so that they are not overlooked. This will produce proach only refines the coarse index into a B+ tree after all more reliable results than straight reservoir sampling, even data has been indexed, we chose to refine the index immedia- if they are somewhat biased. Mind this mode of querying is tely after indexing a new fragment of the original data. The only temporary, as it is only included to bridge the time nee- small overhead of immediate refinement to a full B+ tree ded to preprocess the data set and create the initial index allows for much faster queries than on a coarse index. The- structure which incorporates a stratified sampling approach re queries would quickly reach undesirable delays if issued to take into account data skew and rare populations. over large indexes. If the index becomes too large that que- rying it completely becomes too costly, we can resort either 3.3 Index Generation to querying only a statistically relevant sample of the index After having created an initial random sample and a full or to pruning the index to a subset of the original data. To pass over the data to collect distributional metadata, a par- keep these approximate queries or pruned indexes statisti- Figure 6: Distribution of the age value of the test data set. Figure 7: Distribution of the salary value of the test data set. cally relevant, we reference the distributional metadata to create an appropriate approximation of the original data. It should be obvious, that at this point, the index structure is not only useful for approximate queries, but will also serve in scenarios where exact results are required. 4. EVALUATION Since it is not really interesting if the queries themselves are faster for an approximate setting instead of a contempo- rary setting, we will focus on comparing the time until the progressive indexing converges to a full query compared to the approach of Holanda et al., and how well the stratified index manages to account for data skew while uncompleted. 4.1 Setup For the evaluation, we have implemented the system in Python version 3.8.3. The experiments were run on a ma- chine with an Intel 6700k processor at 4.4GHz and 32GB of Figure 8: Duration for the B+ index reaching full convergence DDR4 RAM at 2133MHz. while repeatedly issuing the simple query (age ≥ 32). 4.2 Data We have generated a dataset with five million rows contai- seconds to a fully converted index, while our approach only ning data about the age, salary and received tips of waiters. takes 18 seconds. We thus reach index convergence roughly To test the indexing prototype, we made sure to integrate 22 times faster than the original approach. This is mostly different value distributions and amount of skew into the due to the fact that the original approach relies on queries data. The tip tuples are uniformly distributed. The distri- always ranging over the complete data base to further gene- bution of the age and salary values (see Figures 6 and 7) is rate the index. As a result, the index cannot be constructed each skewed towards a particular value range. further until the current query has finished processing all tuples. Our solution on the other hand utilizes the much 4.3 Time until Index Convergence shorter time to return of approximate queries to be able to For this experiment, we implemented progressive indexing quickly construct an index. in a similar fashion as Holanda et al. demonstrated in their approach (specifically the Radix MSD variation in its base 4.4 Accounting for Data Skew form) and ran it on the same data set. Only one index on the In this experiment, we looked at how well our sampling ap- ‘age’ column was constructed. Each second we repeatedly is- proach can handle data skew. For this, we ran a count(’age’) sued a simple query (age ≥ 32) until the index converged. query run once with an initial reservoir sample generated in We did not implement the adaptive indexing budget pro- the first step before a stratified index is built, and after- posed by Holanda et al., but instead set the to-be-indexed wards on a reservoir sample including rare subpopulation fraction with each query to 10%. The sample size for our sampling. The sample size for this test was 5000. Figure 9 version was again set to 5000. In Figure 8 we can see that shows the pitfall of standard reservoir sampling. Since the a progressive indexing approach greatly benefits from ap- values 25 and 26 represent the bulk of the entries, other ent- proximate querying techniques. The base version takes 404 ries are easily missed. Especially the rare outliers are excee- get was used to automatically adjust how much of the query time should be used for indexing. We could imagine a simi- lar solution for automatically determining the sample size for the progressive index in an approximate query environment. For the stratified sampling, we chose a very simple solution to determine how the samples should be generated by just relying on the skew of one column. In a real environment, this might not be viable since other columns might be just as interesting. Creating a more refined scheme to account for skew found in several attributes would further increase the reliability of the partial results returned by approximate queries with our proposed indexing strategy. Figure 9: Distribution of a sample of salary values taken with reservoir sampling. 6. REFERENCES [1] S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, pages 487–498. ACM, 2000. [2] S. Agarwal, A. Panda, B. Mozafari, S. Madden, and I. Stoica. Blinkdb: Queries with bounded errors and bounded response times on very large data. CoRR, abs/1203.5485, 2012. [3] M. Angelini, T. May, G. Santucci, and H.-J. Schulz. Figure 10: Distribution of a sample of salary values taken with On quality indicators for progressive visual analytics. stratified sampling. 06 2019. [4] M. Angelini, G. Santucci, H. Schumann, and H. Schulz. A review and characterization of dingly unlikely to be included in the initial sample. Figure progressive visual analytics. Informatics, 5(3):31, 2018. 10 shows the impact rare population sampling strategy can [5] M. P. Cohen. Stratified sampling. In M. Lovric, editor, have on a sample of the same size. Even exceedingly rare International Encyclopedia of Statistical Science, outliers (age 30-36) are included in the sample. While not pages 1547–1550. Springer, 2011. correct, this sample composition is a much better represen- [6] K. Dursun, C. Binnig, U. Çetintemel, and T. Kraska. tation of the real data set. Here we can see the advantage Revisiting reuse in main memory database systems. of including rare population sampling in addition to reser- CoRR, abs/1608.05678, 2016. voir sampling until we have a true stratified sample. When [7] J. Fekete, D. Fisher, A. Nandi, and M. Sedlmair. the stratified index structure is built later, data skew largely Progressive data analysis and visualization (dagstuhl stops being a problem as we now know the distribution of seminar 18411). Dagstuhl Reports, 8(10):1–40, 2018. values. As shown in Figure 5, the distribution is not exact [8] J. Fekete and R. Primet. Progressive analytics: A simply because we can not include fractions of tuples, but computation paradigm for exploratory data analysis. good enough to give a quite accurate picture of the complete CoRR, abs/1607.05162, 2016. data. [9] A. Galakatos, A. Crotty, E. Zgraggen, C. Binnig, and T. Kraska. Revisiting reuse for approximate query 5. CONCLUSION AND OUTLOOK processing. PVLDB, 10(10):1142–1153, 2017. For this work we prototyped a progressive indexing ap- [10] A. Gogolou, T. Tsandilas, T. Palpanas, and proach usable in an approximate querying environment in- A. Bezerianos. Progressive similarity search on time spired by a Progressive Indexing approach by Holanda et series data. In Proceedings of the Workshops of the al. [13]. The AQP environment leads to significantly lower EDBT/ICDT 2019 Joint Conference, EDBT/ICDT query execution times, even when the data is not fully in- 2019, Lisbon, Portugal, March 26, 2019, 2019. dexed. To make these approximate results more significant, [11] J. M. Hellerstein, R. Avnur, A. Chou, C. Hidber, we combined reservoir sampling, meta data collection (rare C. Olston, V. Raman, T. Roth, and P. J. Haas. population sampling) and a stratified sampling approach to Interactive data analysis: The control project. remedy for the intrinsic inaccuracy of approximate queries. Computer, 32(8):51–59, 1999. Our evaluation results confirm Holanda’s assumption that [12] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online the progressive indexing approach benefits greatly from the aggregation. In J. Peckham, editor, SIGMOD 1997, much faster time to return of approximate queries on non- Proceedings ACM SIGMOD International Conference indexed data. For future work on this topic, we identified two on Management of Data, May 13-15, 1997, Tucson, major areas of potential improvements: Including an inde- Arizona, USA, pages 171–182. ACM Press, 1997. xing budget, and expanding the stratified sampling strategy. [13] P. Holanda, S. Manegold, H. Mühleisen, and In the original approach by Holanda et al., an indexing bud- M. Raasveldt. Progressive indexes: Indexing for interactive data analysis. PVLDB, 12(13):2366–2378, [17] C. D. Stolper, A. Perer, and D. Gotz. Progressive 2019. visual analytics: User-driven visual exploration of [14] C. M. Jermaine, S. Arumugam, A. Pol, and A. Dobra. in-progress analytics. IEEE Transactions on Scalable approximate query processing with the DBO Visualization and Computer Graphics, engine. In C. Y. Chan, B. C. Ooi, and A. Zhou, 20(12):1653–1662, 2014. editors, Proceedings of the ACM SIGMOD [18] C. Turkay, N. Pezzotti, C. Binnig, H. Strobelt, International Conference on Management of Data, B. Hammer, D. A. Keim, J. Fekete, T. Palpanas, Beijing, China, June 12-14, 2007, pages 725–736. Y. Wang, and F. Rusu. Progressive data science: ACM, 2007. Potential and challenges. CoRR, abs/1812.08032, 2018. [15] Z. Liu and J. Heer. The effects of interactive latency [19] M. van den Berg. Applying progressive indexing in the on exploratory visual analysis. IEEE Trans. Vis. context of approximate query processing. 2021. Comput. Graph., 20(12):2122–2131, 2014. [20] J. S. Vitter. Random sampling with a reservoir. ACM [16] T. K. Sellis. Review - the aqua approximate query Trans. Math. Softw., 11(1):37–57, 1985. answering system. ACM SIGMOD Digit. Rev., 2, 2000.