ViewSeeker: An Interactive View Recommendation Tool Xiaozhong Zhang Xiaoyu Ge Panos K. Chrysanthis Mohamed A. Sharaf University of Pittsburgh University of Pittsburgh University of Pittsburgh University of Queensland xiaozhong@pitt.edu xiaoyu@cs.pitt.edu panos@cs.pitt.edu m.sharaf@uq.edu.au ABSTRACT View recommendation has emerged as a powerful tool to assist data analysts in exploring and understanding big data. Existing view recommendation approaches proposed a variety of utility functions in selecting useful views. Even though each utility function might be suitable for specific scenarios, identifying the most appropriate ones along with their tunable parameters, which represent the user’s intention during an exploration, is a challenge for both expert and non-expert users. This paper presents the first attempt towards inter- active view recommendation by automatically discovering the most appropriate view utility functions in an exploration based on the user’s preferences. In particular, our proposed ViewSeeker uses a novel active learning method to discover the view utility function by interactively refining the set of k view recommendations. The effec- Figure 1: Example of a view providing an insight about the per- tiveness and efficiency of ViewSeeker was experimentally evaluated formance of an NBA team. using both synthetic and real data sets. 1 INTRODUCTION scenarios, identifying the most appropriate ones and their tunable pa- The ubiquitously available information sources and the advance- rameters, which represent the user’s intention during an exploration, ments in data storage and acquisition techniques have led to an is still a challenge for both expert and non-expert users. aggressive increase in the data volumes available for data analysis Motivated by the need for supporting visualization recommen- tasks. One major challenge in utilizing these abundantly available dation, which is tailored to the user’s exploration task, we propose data is discovering insights from them effectively and efficiently. an interactive view recommendation tool, called ViewSeeker, that Examples of an “insight” include the structure, patterns, and causal efficiently assists the user to explore visually large, high-dimensional relationships. To explore these massive and structurally complicated datasets. The main idea of the ViewSeeker is to automatically iden- datasets, data analysts often utilize visual data analysis tools such as tify the most appropriate utility function based on the user’s feedback Tableau [2] and Voyager [28]. However, the effectiveness of these on selected views. In particular, ViewSeeker iteratively presents a tools depends on the user’s expertise and experience. Coming up set of views to the user, and the user is asked to provide simple with a visualization that shows interesting trends/patterns is a non- feedback to each of the presented views. Utilizing such feedback, trivial issue. Consider, for example, a view comparing the player ViewSeeker learns the most appropriate utility function by employ- 3-point attempt rate of a selected NBA team with that of all teams ing a novel active learning method. In each iteration, ViewSeeker in the league (Figure 1), which could explain why the selected team effectively predicts and refines the view utility function and identify on the right (black) outperformed the league average on the left those views, from all possible views, which have high values in (gray) and won a championship. The analyst needs to examine the their utility functions. ViewSeeker ensures a sub-second runtime relationships among various attributes and consider various aggre- response time for each subsequent iteration, by employing multiple gate functions before any useful visualizations can be discovered. optimization techniques such as pruning, sampling, and ranking. This approach is typically ad-hoc, labor-intensive, and not scalable, To verify the effectiveness of our proposed interactive view rec- especially for high-dimensional databases. ommendation tool, ViewSeeker, we implemented a prototype system To address the shortcoming of the current visual analysis tools, and experimentally evaluated it using both a synthetic dataset and several methods for recommending visualizations have recently been a real-world dataset of diabetic patients [1]. Our evaluation results proposed (e.g., [5, 6, 11, 12, 16, 18, 27]). These methods automati- have confirmed the effectiveness of ViewSeeker, such that on aver- cally generate all possible views of data, and recommend the top-k age only 7-16 user answers (feedback) are required to obtain a set of interesting views, according to some utility function (e.g., deviations, highly accurate top-k view recommendations. data variance, usability) that measures the interestingness of data. Specifically, the contributions of this paper are the following: Even though each utility function might be suitable for specific (1) Designing an interactive view recommendation tool called ViewSeeker that iteratively discovers the most appropriate © 2019 Copyright held by the owner/author(s). Published in the Workshop Proceedings view utility function and refines the recommended views (i.e., of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on CEUR-WS.org. histograms or bar charts) with the feedback that the user provided. BigVis, 2019, Lisbon, Portugal Xiaozhong Zhang, Xiaoyu Ge, Panos K. Chrysanthis, and Mohamed A. Sharaf (2) Proposing a suite of optimization techniques that improve the efficiency of the system, while minimizes the impact on the quality of the recommended views. (3) Implementing a prototype system of the proposed ViewSeeker, and then verified the effectiveness of our proposed approach on both synthetic and real-world datasets. Outline The rest of the paper is structured as follows. Section 2 introduces our problem definitions. Section 3 presents our proposed approach. Section 4 and 5 describe our experimental testbed and results. Section 6 discusses related work and Section 7 concludes the paper. 2 PROBLEM FORMULATION In this section, we present the necessary background details of our work. Specifically, we discuss how views can be constructed through SQL queries, and then explain how the interestingness of a view Figure 2: A target view and its corresponding reference view. may be captured through a pre-defined utility function. Afterward, we formally present our proposed problem. (a, m, f ) used by the target view. An example of a target view with 2.1 Views & Data Visualization its reference view is illustrated in Figure 2. Deviation measures the difference between the target view and To begin, we start by describing a view (i.e., histogram or bar chart) the reference view with an underlying assumption that the greater the in the context of structural databases. A view vi essentially repre- distance between the target view and the reference view, the higher sents an SQL query with a group-by clause over a database D. Under the utility is. In the case of histograms or bar charts, measuring the the typical multi-dimensional data models, data can be modeled as distance between two views vi and v j essentially equals measuring a set of measure attributes M = {m 1 , m 2 , m 3 , ...} and a set of di- the distance between two normalized probability distributions P (vi ) mension attributes A = {a 1 , a 2 , a 3 , ...}. The measure attributes (e.g., and P (v j ). Formally, the utility score u (vi ) of a view vi computed number of items sold) is the set of attributes that contain measurable from deviation can be defined as: value and can be aggregated. The dimensional attributes (e.g., brand, year, color, size) is the set of attributes on which measure attributes u (vi ) = DT (P (viT ), P (viR )) (2) are viewed. To formulate an SQL query with a group-by clause, we where DT is the distance function that measures the distance between need to have some a set of aggregation functions F = { f 1 , f 2 , f 3 , ...}. two distributions (e.g., Euclidean, Earth Movers Distance). Thus, we can represent each view vi as a triple (a, m, f ), such that Consequently, the typical view recommendation problem can be one dimension attribute a is applied to one aggregation function f defined as follows: on the corresponding measure attribute m. Consequently, the View Space (VS), i.e., the total number of possible views is: Definition 1. (View Recommendation Problem) Given a database D, a user-specified query Q, a set of results R produced by Q, a utility V S = 2 × |A| × |M | × |F | (1) function u (), and the size of the preferred view recommendations K. Clearly, VS can be large, especially with high-dimensional data. Find the top-k views v 1 , v 2 , ..., vk constructed from R that have the In order to recommend the set of k most interesting views from a highest utilities according to u () among all possible views. large number of views, utility scores are required to rank all the views. To compute such utility scores, existing literatures have pro- 2.2 Problem Settings posed a large number of utility functions, some commonly used ones The above definition of a typical view recommendation problem includes deviation [27], accuracy [5], usability [5] and p-value [26]. assumes that the utility function is given [19, 21, 28, 29]. In this Furthermore, these utility functions also contain their own param- section, we will formalize our problem of interactive view recom- eters, and any of these functions can further be combined linearly mendation in which the composition of the utility function is not with others to form composited utility functions, thus leading to an given but is discovered. enormous search space for the utility functions. We use the family of deviation-based utility functions as an ex- Definition 2. (View Utility Function Selection Problem) Consider a ample to further explain the potential search space of the utility d-dimensional database D. Further, consider a user-specified query function and illustrate how the interestingness of a view may be Q, a set of results R produced by Q, a set of n possible utility function measured. For clarity, we call each original view a target view viT , U = {u 1 , u 2 , ..., un }, and the size of the preferred view recommen- which is represented as a triple (a, m, f ) applied to a subset of the dations K. Find the utility function u p (), which produces top-K p p p data D Q that is produced by a given user query Q. In order to define views V p = {v 1 , v 2 , ..., vk }, constructed from R based on user’s the deviation, we create a helper view called the reference view viR p feedback, such that u () can be any arbitrary combination of the for each target view. The reference view visualizes the results of utility functions in U and most accurately captures the user’s ideal grouping the data in the whole database D with the same set of triple utility function u ∗ (). ViewSeeker: An Interactive View Recommendation Tool BigVis, 2019, Lisbon, Portugal Clearly, view recommendation with dynamic view utility function Algorithm 1 The ViewSeeker selection, is a more challenging problem compared to traditional Require: The raw data set D R and a subset D Q specified by a query view recommendation, given that it has a much higher search space Ensure: The view utility estimator V E complexity, because it combines the traditional View Space (Eq. 1) 1: Unlabeled view set U ← дener at eV iews (D Q , D R ) with the search space of the components of the utility functions. 2: Labeled view set L ← obtain initial set of view labels 3: V E ← initialize view utility estimator V E using L U tilityV S = V S × |u 1 ()| × · · · × |un ()| (3) 4: U E ← initialize uncertainty estimator U E using L 5: loop where ui () is a utility function, i = 1, .., n. 6: Choose one x from U using U E That is, in the context of our problem, we are expanding the 7: Solicit user’s label on x representation of a view vi from a triple (a, m, f ) to be a tuple 8: L ← L ∪ {x } (a, m, f , u 1 (), ..., un ()), and central to our solution is discovering the 9: U ← U − {x } ideal utility function u ∗ (), interactively through user feedback: 10: V E ← refine V E using L 11: U E ← refine U E using L u ∗ () = β 1u 1 () + · · · + βn un () (4) 12: T ← recommend top views using V E 13: if the user is satisfied with T or the user wants to stop then where βi is the weights assigned to the corresponding possible utility 14: Break function ui , i = 1, ..., n. It can be seen from this equation that u ∗ () 15: end if can be any linear combination of the individual utility functions. 16: end loop For instance, u ∗ () can be mapped to a single utility function ui , in 17: Return the most recent V E this case βi = 1 and all other β = 0; or u ∗ () can be a combination of multiple utility functions, where a set of corresponding β of the utility functions in u ∗ () sum to 1 and the remaining β are set to 0. ViewSeeker operates in two phases: an Off-line Initialization, and Since our objective is to predict the user’s most preferred utility Interactive View Recommendation. function u ∗ () with high precision, we can measure precision by the distance between the top-K views V p , recommended by our solution 3.1 Off-line Initialization using the predicted utility function u p (), and those top-K views V ∗ In the first phase, ViewSeeker generates all possible views in an produced by the ideal utility function u ∗ (). Particularly, a utility offline pre-processing fashion to facilitate the subsequent execution. function u p (), which selects top-k views closer to V ∗ , is considered This phase proceeds in two stages. During the first, for each view vi , to be more preferred than a utility function u p (), which selects top-k ′ ViewSeeker generates two corresponding views: the reference view views farther away from V . More details on the evaluation will be ∗ viR and the target view viT . In particular, viR shows the aggregate provided later, in Section 4. values from D R and viT shows the corresponding aggregate values from D Q . Each view is generated by aggregating the data from a Definition 3. (Interactive View Recommendation Problem) Find specific dataset along a dimension attribute A, using an aggregation the solution of the View Utility Function Selection problem such function F on a measure attribute M, as discussed above. as the computational delay between each subsequent interactions To facilitate the computation, ViewSeeker represents each target (feedback) with the user is within a time constraint tl. and reference view as probability distributions, which are stored in- In each interaction between the system and the user, the user is ternally as vectors. Consequently, for each view vi , two probability presented with a set of M views and expected to provide feedback distributions, P (viT ) and P (viR ), are obtained for its target view viT on the interestingness of each view. The feedback should reflect and reference view viR , respectively. The conversion from views to the user’s belief with respect to the interestingness of each view, probability distributions can be achieved simply through normaliza- which would be a number ranging from 0-1, with 0 being the less tions. As illustrated below, in Equation 5, we normalize each view interesting and 1 being the most interesting. Utilizing the feedback vi by individually dividing the aggregate value of each bin in vi by the system would provide better recommendations of views in the the sum of the aggregated values of all bins in vi , such that the sum subsequent iterations. Furthermore, to enable fluent user interaction, of aggregated values of all bins in vi would become 1. the response time between each subsequent iteration must be within д1 д2 д the time constraint tl, which is typically below one second. P (vi ) = ⟨ , , ..., b ⟩ (5) G G G where P (vi ) is the probability distribution after normalization; дi are 3 THE VIEWSEEKER individual values in each bin; G = bi=1 дi is the sum of the values P In this section, we discuss the details of our proposed ViewSeeker, a in all bins; and b is the number of bins in the dimension attribute A. novel view recommendation tool, which interactively finds the set In the second stage, ViewSeeker would produce an internal rep- of views that align best with the user’s interest. resentation for each view, which is needed for the training of the As shown in Algorithm 1, the ViewSeeker takes a dataset D R to view utility estimator. As mentioned above, a variety of different use as reference and a subset of the reference data D Q as inputs. utility functions have been previously proposed. To find the most D Q can be specified by any data specification method such as an appropriate one that matches the user’s intention, ViewSeeker com- SQL/NoSQL query over D R . The output of ViewSeeker is a view bines each possible utility function into distinct features of views. utility estimator trained with user’s feedback that predicts the utility Specifically, we noticed that each previously proposed utility func- of any given view generated from D R based on user preference. tion is essentially a combination of one or more “utility components” BigVis, 2019, Lisbon, Portugal Xiaozhong Zhang, Xiaoyu Ge, Panos K. Chrysanthis, and Mohamed A. Sharaf (e.g., deviations, usability, accuracy). Thus, we incorporate these Several query strategies that define the "informativeness" of exam- components as additional features of the views denoted as utility ples have been proposed in the literatures (e.g., [14, 22–24]). Due features. The value of a utility feature of a view vi is the result of the to the interactive nature of ViewSeeker, we choose to use the most corresponding “utility component” (e.g., deviation) computed on vi . efficient query strategies, named uncertainty sampling [14], as the For illustration purposes, in our current tool, we have imple- way to measure the benefit of labeling one view. mented eight utility features for each view. The first five utility The intuition underlying uncertainty sampling is that patterns features are the deviation between target view and reference view with high uncertainty are hard to classify, and thus if the labels with different distance measures: Kullback-Leibler divergence (KL- of those patterns are obtained, they can boost the accuracy of the divergence), Earth Mover Distance (EMD), L1 distance, L2 distance classification models. Particularly, in binary classification models and the maximum deviation in any individual bin. The last three util- (with class labels 0 and 1), the most uncertain example x is the one ity features represent the usability [5], accuracy [5], and p-value [26]. which can be assigned to either class label z(x) with probability 0.5. Usability refers to the quality of the visualization in terms of provid- Inspired by such idea of uncertainty, also known as least confi- ing the analyst with an understandable, uncluttered representation, dence, we adopted the measurement of uncertainty proposed in [14] which is quantified via the relative bin width metric. Accuracy refers for binary classification models: to the ability of the view to accurately capture the characteristics (i.e., u (lc ) (x) = 1 − p(ŷ|x) (6) distribution) of the analyzed data, which is measured in terms of Sum Squared Error (SSE). The p-value is a statistical term defined as where u (lc ) (x) is the uncertainty score with a least confidence mea- “the probability of obtaining a result equal to or more extreme than surement of x; ŷ means the predicted class label of the unlabeled what is actually observed, with the given null hypothesis being true” x. Accordingly, after measuring the uncertainty of each unlabeled [13]. In the problem of view recommendation, the null hypothesis sample, the unlabeled sample with highest uncertainty is selected: refers to the reference view, and the extremeness of the results refers x∗ = argmaxxu (x) (7) to the interestingness of the target views. It should be noted that, in general, users may customize the util- where u (x) can be any other measurement of informativeness over ity features, including adding new ones, for personalized analysis. the unlabeled sample x. The current set of utility features mentioned above are selected to Since estimating the uncertainty of a given view requires a proba- illustrate the effectiveness of ViewSeeker. bilistic based machine learning model, the view utility estimator (i.e., non-probabilistic linear regression model) cannot be used to obtain 3.2 Interactive View Recommendation the uncertainty score. To overcome this challenge, we employed In the second phase, ViewSeeker interactively presents views to the a separate Logistic Regression model trained on the same set of user and then requests user feedback on each presented view—we labeled views as the view utility estimator to serve as the uncertainty denote such feedback as a label. In each iteration, ViewSeeker would estimator, which estimates the uncertainty of each view. present M example views (default M = 1) selected from all possible Machine learning models such as the uncertainty estimator must views to the user, and the user is expected to express their interest be trained with both positive (i.e., interesting) and negative (i.e., with respect to each example view with a numeric label that ranges not interesting) views. For this reason, ViewSeeker splits its second between 0.0 - 1.0 with 1.0 being the most interesting—Example phase into two stages as well, such that the first stage addresses the values would be 0.0 (not interesting), 0.7, 0.9, 1.0. “cold start" issue of the system and the second stage quickly refines After each iteration, ViewSeeker would use all collected feedback the view utility estimator to discover the set of k most interesting to train a Linear Regression model as the view utility estimator that views. predicts for each view the corresponding score produced by the ideal The “cold start" issue is basically the acquisition of the first utility function u ∗ (). We choose Linear Regression as the view utility positive view. To facilitate this process, ViewSeeker would first estimator because the task for predicting the utility score of a view select views ranked highest according to each utility feature (e.g., can naturally be seen as a regression task. deviation, accuracy, usability, p-value). Each utility feature would It can be easily noticed that the effectiveness of our approach then be considered in a sequential manner, such that in each iteration depends heavily on the example views being presented to the user the set of M views ranked highest according to the current utility for labeling. However, to measure exactly the benefit of each user feature under consideration would be chosen from all possible views label before obtaining it is difficult, especially given the fact that and then presented to the user for their feedback (i.e., labels). In the delay between each subsequent iteration cannot exceed the time the case where no positive or negative feedback has been received constraint tl. Fortunately, our problem of finding the most beneficial after visiting all dimensions, ViewSeeker will then switch to random views to be labeled aligned with the objective of Active Learning sampling for the subsequent interactions with the user. techniques [22]. In the second stage of the second phase, ViewSeeker uses the Active Learning is an interactive machine learning framework uncertainty estimator to choose the most informative views to be that achieves accurate classification with minimum human super- presented to the user for feedback in each iteration. Specifically, vision. A typical active learning approach would employ a query when the output likelihood p(ŷ|x) of the uncertainty estimator of strategy to sequentially select which unlabeled example (i.e., object) a view is closest to 50%, this view is considered to be the most in the database should be presented to the user next for labeling. A uncertain, and thus should be preferred to views that are less un- query strategy attempts to minimize the labeling costs by selecting certain. During each iteration, ViewSeeker would sample as many the most informative examples for building the classification model. distinct views as possible within the given time constraint tl and ViewSeeker: An Interactive View Recommendation Tool BigVis, 2019, Lisbon, Portugal Table 1: Testbed Parameters Table 2: Simulated Ideal Utility Functions Total number of records 10 × 105 (DIAB) # Involved utility features and weights 10 × 106 (SYN) 1 1.0 * KL Cardinality ratio of records in D Q 0.5 % 2 1.0 * EMD Number of dimension attributes (A) 7 (DIAB), 5 (SYN) 3 1.0 * MAX_DIFF Number of distinct values in A variable (DIAB), 3 and 4 (SYN) 4 0.5 * EMD + 0.5 * KL Number of measure attributes (M) 8 (DIAB), 5 (SYN) 5 0.5 * EMD + 0.5 * L2 Number of aggregation functions 5 6 0.5 * EMD + 0.5 * p-value Number of view utility feature 8 7 0.3 * EMD + 0.3 * KL + 0.4 * MAX_DIFF Utility estimator Linear regressor 8 0.3 * EMD + 0.3 * L2 + 0.4 * MAX_DIFF Number of views presented per iteration 1 9 0.3 * EMD + 0.3 * p-value + 0.4 * MAX_DIFF Performance measurement Top-k precision 10 0.3 * EMD + 0.3 * KL + 0.4 * Usability The number of views to recommend (k) 5, 10, 15, 20, 25, 30 11 0.3 * EMD + 0.3 * KL + 0.4 * Accuracy Optimization partial data ratio α 10% Optimization time limit per iteration 1 second Optimization performance measurement Top-k precision, Running time ViewSeeker to hide the necessary computation, and in turn, makes the delays transparent to the user. To utilize spare computing power efficiently, it is important to then present M views with the highest uncertainty scores to the user determine the order of the incremental view computation. To do so, for feedback. After each iteration, the newly acquired label(s) (i.e., ViewSeeker uses the current view utility estimator to rank the views, feedback) would be merged with all previously obtained labels for and the views ranked highly would have higher priority in computing use as the training data for the refinement of both the view utility the accurate utility features. Effectively, these optimizations allow estimator and uncertainty estimator. ViewSeeker to reduce the unnecessary computation by pruning out After each refinement, the system will show the user the set of the calculations for views that are less promising. top-k views ranked highest according to the utility score produced by the most recent view utility estimator. Once the user is satisfied 4 EXPERIMENTAL TESTBED with the set of recommended views, the entire exploration process terminates, and the most recently trained view utility estimator u p () We evaluated the benefit of using ViewSeeker through a simulated is selected as the estimation of u ∗ (). user study, and measured both efficiency and effectiveness. Here we present the details of our testbed (Table 1). 3.3 Optimizations Setup We built a ViewSeeker platform in Python. Our experiments In this section, we introduce the key optimizations built in ViewSeeker were performed on a Core i5 server with 8GB of RAM. to speed up the interactive recommendation process while minimiz- Data Sets Two datasets were used in the experiments: DIAB & SYN. ing the reduction in the effectiveness of the recommendation. DIAB is a categorical dataset of diabetic patients [1]. It contains Recall that in the first phase, ViewSeeker converts each basic 100 thousand records. We removed the attributes that have a large utility component into utility features and embeds them into the amount of missing data or are very sparse. After preprocessing, the vector representation of the views. During this conversion, each data set had 7 dimension attributes A, and 8 measure attributes M, utility feature needs to be calculated for all possible views, which is and a total of 280 distinct views were generated. time-consuming. As the majority of the views are typically not of the SYN is a synthetic dataset with 1 million numerical records that user’s interest, it is unnecessary to fully compute these view utilities contains 5 dimension attributes, 5 measure attributes, and 2 bin with the entire database. Instead, we propose an optimization that configurations (i.e., we create views with 3 bins or 4 bins). The aims to compute only the set of utilities for views that are more values of the attributes of each record are uniformly distributed. The likely to be of interest to the user. total view space for the SYN dataset is 250 distinct views. Specifically, our optimization works as follows. During the first Simulated User Study Initially, we generated the view utility fea- phase, we compute the set of utility features for each possible view tures in two-step: 1) we created a hypercube in the recording space only with α percent of the data, where α is a pre-defined ratio, which to represent D Q , which is a subset of data specified by a query; and can be tuned based on data size, available system resources, and 2) we generated view utility features based on the different utility time constraint tl. In particular, ViewSeeker will uniformly sample functions mentioned in Section 3. Afterward, we simulated different α percent of data from the underlying database to produce a "rough" user utility functions u ∗ (), each of which is a weighted sum of one utility score for each utility feature of each view. or more individual utility functions. For each presented view vi , we Later on, during the second phase, ViewSeeker will incrementally simulated the user’s belief with respect to the interestingness of a refine the utility score of each view with the entire set of data when- view through the normalized utility score produced by the u ∗ (vi ), ever there is spare computing power available between user labeling such that u ∗ (vi ) = 0.7 indicates the interestingness of view vi is prompts while ensuring the time constraint tl is obeyed. In other about 70% of the maximum. words, the main idea is to quickly compute a set of initial "rough" Performance We evaluated the performance of our system in the values for the utility features of each view to enable subsequent in- aspect of recommendation precision. In particular, we showed the teractions and then incrementally compute the accurate utility scores effectiveness of our proposed approach by measuring the number for each view while leveraging the user interaction time. This allows of example views needed to reach 100% precision. Here we define BigVis, 2019, Lisbon, Portugal Xiaozhong Zhang, Xiaoyu Ge, Panos K. Chrysanthis, and Mohamed A. Sharaf (a) Single Component u∗ () (b) Two Components u∗ () (c) Three Components u∗ () Figure 3: Recommendation precision for DIAB dataset with different ideal utility functions u∗ (). (a) Single Component u∗ () (b) Two Components u∗ () (c) Three Components u∗ () Figure 4: Recommendation precision for SYN dataset with different ideal utility functions u∗ (). the precision as the size of the intersection between the top-k views is based), and the y-axis is the number of example views presented, recommended by the ViewSeeker and the top-k views recommended capturing the user effort required for the precision to reach 100%. by the u ∗ (). For two sets of top-k views V p and V ∗ produced by Specifically, in Figures 3a and 4a, we evaluated the effectiveness ViewSeeker and u ∗ (), respectively, the precision is calculated as: of ViewSeeker with respect to the ideal utility functions that contain |V p ∩ V ∗ | only a single utility component (i.e., results are averaged over ideal . k Utility Functions 1-3 in Table 2). In Figures 3b and 4b, we repeated Simulated Ideal Utility Functions We evaluated the effectiveness the evaluation for composited ideal utility functions that consists of and efficiency using 11 diverse ideal utility functions u ∗ () that in- two utility components (i.e., results are averaged over ideal Utility cluded 3 single component utility functions and 8 multi-component, Functions 4-6). In Figures 3c and 4c, we repeated the evaluation for composite utility functions (Table 2). We chose the components in composite ideal utility functions with three utility components (i.e., multi-component u ∗ () carefully such that they represent different results are averaged over ideal Utility Functions 7-11). characteristics of the view. For example, EMD focuses on devia- From these results, we can observe that our proposed ViewSeeker tions across bins, KL-divergence indicates the sum of deviation in is extremely effective in discovering the set of ideal top-k views: individual bins, Usability represents the visual quality of a view, etc. for k ranging from 5-30, on average only 7-16 labels were required before the ViewSeeker reached a precision of 100% for both DIAB and SYN datasets. Clearly, this indicates that only a small amount 5 EXPERIMENTAL RESULTS of user effort is needed before a satisfactory set of results can be We evaluated both the effectiveness of ViewSeeker and its optimiza- obtained by the ViewSeeker. tions by contacting three experiments. Experiment 2: We compared the top-k recommended views by ViewSeeker with the top-k recommended views by the baselines in 5.1 Evaluation of Effectiveness terms of the maximum achievable recommendation precision. We We contacted two experiments to evaluate the effectiveness of the use the 8 individual utility features (e.g., KL, EMD, L1, L2, etc.) as ViewSeeker. The first measured the user effort, whereas the second the baselines. Figure 5 shows the result for ideal Utility Function 11 measured the precision of the predicted utility function. (i.e., u ∗ () = 0.3 ∗ EMD + 0.3 ∗ KL + 0.4 ∗ Accuracy) in the DIAB Experiment 1: Figures 3 and 4 illustrate the effectiveness of the dataset. We observe that the ViewSeeker achieved a 3X improvement ViewSeeker by showing the number of example views that need against the best-performing baseline (i.e., EMD) in recommendation to be labeled in order for the view utility estimator to reach 100% precision. This indicates that using single fixed utility features will precision in the top-k recommended views. Here, the x-axis is the k not be able to capture a complex ideal utility function, which best in top-k (i.e., the number of views on which the precision calculation captures the user’s interest, such as the one above. ViewSeeker: An Interactive View Recommendation Tool BigVis, 2019, Lisbon, Portugal to train a general machine learning model that predicts the utility score of any given view [16]. The generic priors were obtained by hiring a large number (i.e., 100) of human annotators to annotate multiple real-world datasets. The key difference between our work and all prior work is that all previous works use predefined view utility functions and do not discover the utility function that best matches an individual user’s intention and exploration task. Interactive Visualization Tools have been studied extensively for the past few years [3, 7, 9, 11, 12, 15, 17, 25]. Unlike visual- ization recommendation tools such as ViewSeeker that recommend visualization automatically by searching through the entire views spaces, traditional interactive visualization tools require the user to manually specify the views to be generated. Recently, a few interac- Figure 5: Precision comparison with individual utility features tive visualization tools have attempted to automate part of the data analysis and visualization process. For instance, Profiler automati- cally helps analysts detect anomalies in the data [11]. But, unlike our 5.2 Evaluation of Optimizations approach, Profiler is not capable of providing general recommenda- We evaluated the effectiveness of our optimization techniques by tions for any group-by queries. Another recent example is VizDeck comparing the recommendation precision and runtime between the [12], which generates all possible 2D views for a given dataset and optimizations-enabled ViewSeeker and the optimizations-disabled allows the user to manually control (e.g., reordering, pinning) these ViewSeeker (i.e., baseline model). views through a dashboard, rather than using a utility function. The When comparing the recommendation precision, in order to elim- work [3] is the closest to our work in the sense that it also uses user inate the non-determinism in selecting the k t h view in top-k we feedback to steer the view exploration. However, the user feedback introduced the concept of utility distance (UD), which is defined as: is only used to train a classifier, which is not capable to estimate the  X X  ideal utility function and get the top-k views. UD = u ∗ (v i ) − u ∗ (v i ) /k (8) v i ∈V ∗ v i ∈V p Data Exploration techniques that aim to efficiently extract knowl- edge from data [20] are complementary to our work. In particular, where V ∗ are top-k views recommended by the ideal utility function example-driven data exploration approaches [4, 8] assume mini- u ∗ (), and V p are top-k views recommended by ViewSeeker. mum prior knowledge of the data and share the same underlying To explain the top-k non-determinism that motivated U D, note approach as ViewSeeker. These works aim to iteratively construct that views directly after the k t h view may have very close, or even the exploratory query through user interactions as ViewSeeker itera- identical, utility as the k t h view. In such cases, changing the order tively discovers the utility function using user feedback. ViewSeeker among these close views should not affect the precision too much, if is well suited to such situations and can enhance example-driven any. U D has this desired property because its evaluation focuses on data exploration by creating visualizations that illustrate interesting utility distance instead of the exact inclusion of the top-k views. pattern during the construction of the exploratory queries. Figures 6 and 7 show the number of feedback and runtime, respec- tively, needed for both models to reach U D = 0 for the DIAB dataset. 7 CONCLUSION On average, the model with optimization achieved 43% reduction in In this work, we addressed the challenge in visual analytics tools running time while requiring only 19% more user labeling effort. of recommending the set of views most preferred by the user dur- The experiment result confirmed that our optimization methods ing an exploration task. Our contribution, ViewSeeker, is a novel, are effective in reducing computing time by pruning out calculat- interactive view recommendation tool that offers a solution to the ing for views that are less promising. The increase in the required fundamental challenges of: 1) providing effective results with mini- labeling effort is because the view utility features computed on par- mum user effort, 2) enabling efficient and scalable computation, and tial data is only an estimation and may be discrepant from the true 3) requiring no special expertise from the users. features, which hinders the learning of the ideal utility function. The crux of ViewSeeker is the discovery of the utility function, which is used to select the views that best match the user’s explo- 6 RELATED WORKS ration task. ViewSeeker employs an active learning-based predictive In this section, we review works that are strongly relevant to ours. model and effectively learns the user’s preferred views through sim- View Recommendation techniques automatically generate all ple interactions between the user and the system. To enable fluent possible views of data, and recommend the top-k interesting views, user interactions, we proposed a set of optimization techniques that according to some utility function (e.g., [5, 10, 16, 18, 19, 21, 27– significantly improved the runtime efficiency of the ViewSeeker. Our 29]). A key difference among these works is the proposed utility experimental results, using both synthetic and real-world datasets, functions. Recent work placed addition constraints, e.g., upper bound showed that our proposed ViewSeeker outperforms the alternative on the number of views to be explored and execution time limit when baseline approaches by a significant margin. computing the recommended views [10]. The most close work to In the future, we plan to conduct a user study to validate the ours is the use of generic priors (i.e., general knowledge of how recommendation effectiveness of ViewSeeker, and to extend it to people associate views with different datasets and exploration tasks) support more visualization types, such as scatter plot, line chart etc. BigVis, 2019, Lisbon, Portugal Xiaozhong Zhang, Xiaoyu Ge, Panos K. Chrysanthis, and Mohamed A. Sharaf (a) UD = 0 for Single Component u∗ () (b) UD = 0 for Two Components u∗ () (c) UD = 0 for Three Components u∗ () Figure 6: Recommendation Precision with optimization for DIAB dataset with different ideal utility functions u∗ (). (a) RT for Single Component u∗ () (b) RT for Two Components u∗ () (c) RT for Three Components u∗ () Figure 7: System Runtime (RT) with optimization for DIAB dataset with different ideal utility functions u∗ (). Acknowledgments We would like to thank the anonymous review- [14] David D. Lewis and William A. Gale. 1994. A sequential algorithm for training ers for their helpful comments. This work was supported, in part, by text classifiers. In ACM SIGIR. [15] M. Livny, R. Ramakrishnan, K. Beyer, G. Chen, D. Donjerkovic, S. Lawande, NIH under award U01HL137159. This paper does not represent the J. Myllymaki, and K. Wenger. 1997. DEVise: Integrated Querying and Visual views of NIH. Exploration of Large Datasets. In ACM SIGMOD. [16] Yuyu Luo, Xuedi Qin, Nan Tang, and Guoliang Li. 2018. DeepEye: Towards Automatic Data Visualization. In IEEE ICDE. REFERENCES [17] J. Mackinlay, P. Hanrahan, and C. Stolte. 2007. Show Me: Automatic Presentation [1] [n. d.]. https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes. ([n. d.]). for Visual Analysis. IEEE TVCG 13, 6 (2007), 1137–1144. [2] [n. d.]. Tableau public. http://public.tableau.com. ([n. d.]). [18] Rischan Mafrur, Mohamed A. Sharaf, and Hina A. Khan. 2018. DiVE: Diversify- [3] Michael Behrisch, Fatih Korkmaz, Lin Shao, and Tobias Schreck. 2014. Feedback- ing View Recommendation for Visual Data Exploration. In ACM CIKM. driven interactive exploration of large multidimensional data supported by visual [19] Dominik Moritz, Chenglong Wang, Greg L Nelson, Halden Lin, Adam M Smith, classifier. In IEEE VAST. Bill Howe, and Jeffrey Heer. 2019. Formalizing visualization design knowledge as [4] Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. 2014. Explore-by- constraints: actionable and extensible models in Draco. IEEE TVCG 25, 1 (2019), example: an automatic query steering framework for interactive data exploration.. 438–448. In ACM SIGMOD. [20] Davide Mottin, Matteo Lissandrini, Yannis Velegrakis, and Themis Palpanas. 2017. [5] Humaira Ehsan, Mohamed A. Sharaf, and Panos K. Chrysanthis. 2016. MuVE: New Trends on Exploratory Methods for Data Analytics. In VLDB. 1977–1981. Efficient Multi-Objective View Recommendation for Visual Data Exploration. In [21] Belgin Mutlu, Eduardo Veas, and Christoph Trattner. 2016. Vizrec: Recommend- IEEE ICDE. ing personalized visualizations. ACM TiiS 6, 4 (2016), 31. [6] H. Ehsan, M. A. Sharaf, and P. K. Chrysanthis. 2018. Efficient Recommendation [22] Burr Settles. 2009. Active learning literature survey. TR. U. Wisconsin-Madison. of Aggregate Data Visualizations. IEEE TKDE 30, 2 (2018), 263–277. [23] Burr Settles and Mark Craven. 2008. An Analysis of Active Learning Strategies [7] D. Fisher. 2007. Hotmap: Looking at Geographic Attention. IEEE TVCG 13, 6 for Sequence Labeling Tasks. In ACL EMNLP. (2007), 1184–1191. [24] H. S. Seung, M. Opper, and H. Sompolinsky. 1992. Query by Committee. In ACM [8] Xiaoyu Ge, Yanbing Xue, Zhipeng Luo, Mohamed A. Sharaf, and Panos K. COLT. Chrysanthis. 2016. REQUEST: A Scalable Framework for Interactive Construction [25] Chris Stolte and Pat Hanrahan. 2000. Polaris: A System for Query, Analysis and of Exploratory Queries. In IEEE International Conference on Big Data. Visualization of Multi-Dimensional Relational Databases. In IEEE INFOVIS. [9] Hector Gonzalez, Alon Y. Halevy, Christian S. Jensen, Anno Langen, Jayant [26] Bo Tang, Shi Han, Man Lung Yiu, Rui Ding, and Dongmei Zhang. 2017. Extract- Madhavan, Rebecca Shapley, Warren Shen, and Jonathan Goldberg-Kidon. 2010. ing Top-K Insights from Multi-dimensional Data. In ACM SIGMOD. Google fusion tables: web-centered data management and collaboration. In ACM [27] Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya G. Parameswaran, and SIGMOD. Neoklis Polyzotis. 2015. SEEDB: Efficient Data-Driven Visualization Recom- [10] Ibrahim A. Ibrahim, Abdullah M. Albarrak, Xue Li. 2017. Constrained Recom- mendations to Support Visual Analytics. PVLDB 8, 13 (2015), 2182–2193. mendations for Query Visualizations. Knowl. Inf. Syst. 51, 2 (2017), 499–529. [28] Kanit Wongsuphasawat, Dominik Moritz, Anushka Anand, Jock Mackinlay, Bill [11] Sean Kandel, Ravi Parikh, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Howe, and Jeffrey Heer. 2016. Voyager: Exploratory analysis via faceted browsing Heer. 2012. Profiler: integrated statistical analysis and visualization for data of visualization recommendations. IEEE TVCG 1 (2016). quality assessment. In ACM AVI. [29] Kanit Wongsuphasawat, Zening Qu, Dominik Moritz, Riley Chang, Felix Ouk, [12] Alicia Key, Bill Howe, Daniel Perry, and Cecilia R. Aragon. 2012. VizDeck: Anushka Anand, Jock Mackinlay, Bill Howe, and Jeffrey Heer. 2017. Voyager self-organizing dashboards for visual analytics. In ACM SIGMOD. 2: Augmenting visual analysis with partial view specifications. In ACM CHI. [13] Martin Krzywinski and Naomi Altman. 2013. Significance, P values and t-tests. 2648–2659. In Nature methods.