ViewSeeker: An Interactive View Recommendation Tool
        Xiaozhong Zhang                                 Xiaoyu Ge                     Panos K. Chrysanthis                 Mohamed A. Sharaf
       University of Pittsburgh                  University of Pittsburgh             University of Pittsburgh            University of Queensland
        xiaozhong@pitt.edu                        xiaoyu@cs.pitt.edu                    panos@cs.pitt.edu                   m.sharaf@uq.edu.au
ABSTRACT
View recommendation has emerged as a powerful tool to assist data
analysts in exploring and understanding big data. Existing view
recommendation approaches proposed a variety of utility functions
in selecting useful views. Even though each utility function might
be suitable for specific scenarios, identifying the most appropriate
ones along with their tunable parameters, which represent the user’s
intention during an exploration, is a challenge for both expert and
non-expert users. This paper presents the first attempt towards inter-
active view recommendation by automatically discovering the most
appropriate view utility functions in an exploration based on the
user’s preferences. In particular, our proposed ViewSeeker uses a
novel active learning method to discover the view utility function by
interactively refining the set of k view recommendations. The effec-
                                                                                      Figure 1: Example of a view providing an insight about the per-
tiveness and efficiency of ViewSeeker was experimentally evaluated
                                                                                      formance of an NBA team.
using both synthetic and real data sets.


1    INTRODUCTION                                                                     scenarios, identifying the most appropriate ones and their tunable pa-
The ubiquitously available information sources and the advance-                       rameters, which represent the user’s intention during an exploration,
ments in data storage and acquisition techniques have led to an                       is still a challenge for both expert and non-expert users.
aggressive increase in the data volumes available for data analysis                      Motivated by the need for supporting visualization recommen-
tasks. One major challenge in utilizing these abundantly available                    dation, which is tailored to the user’s exploration task, we propose
data is discovering insights from them effectively and efficiently.                   an interactive view recommendation tool, called ViewSeeker, that
Examples of an “insight” include the structure, patterns, and causal                  efficiently assists the user to explore visually large, high-dimensional
relationships. To explore these massive and structurally complicated                  datasets. The main idea of the ViewSeeker is to automatically iden-
datasets, data analysts often utilize visual data analysis tools such as              tify the most appropriate utility function based on the user’s feedback
Tableau [2] and Voyager [28]. However, the effectiveness of these                     on selected views. In particular, ViewSeeker iteratively presents a
tools depends on the user’s expertise and experience. Coming up                       set of views to the user, and the user is asked to provide simple
with a visualization that shows interesting trends/patterns is a non-                 feedback to each of the presented views. Utilizing such feedback,
trivial issue. Consider, for example, a view comparing the player                     ViewSeeker learns the most appropriate utility function by employ-
3-point attempt rate of a selected NBA team with that of all teams                    ing a novel active learning method. In each iteration, ViewSeeker
in the league (Figure 1), which could explain why the selected team                   effectively predicts and refines the view utility function and identify
on the right (black) outperformed the league average on the left                      those views, from all possible views, which have high values in
(gray) and won a championship. The analyst needs to examine the                       their utility functions. ViewSeeker ensures a sub-second runtime
relationships among various attributes and consider various aggre-                    response time for each subsequent iteration, by employing multiple
gate functions before any useful visualizations can be discovered.                    optimization techniques such as pruning, sampling, and ranking.
This approach is typically ad-hoc, labor-intensive, and not scalable,                    To verify the effectiveness of our proposed interactive view rec-
especially for high-dimensional databases.                                            ommendation tool, ViewSeeker, we implemented a prototype system
   To address the shortcoming of the current visual analysis tools,                   and experimentally evaluated it using both a synthetic dataset and
several methods for recommending visualizations have recently been                    a real-world dataset of diabetic patients [1]. Our evaluation results
proposed (e.g., [5, 6, 11, 12, 16, 18, 27]). These methods automati-                  have confirmed the effectiveness of ViewSeeker, such that on aver-
cally generate all possible views of data, and recommend the top-k                    age only 7-16 user answers (feedback) are required to obtain a set of
interesting views, according to some utility function (e.g., deviations,              highly accurate top-k view recommendations.
data variance, usability) that measures the interestingness of data.                     Specifically, the contributions of this paper are the following:
Even though each utility function might be suitable for specific
                                                                                         (1) Designing an interactive view recommendation tool called
                                                                                             ViewSeeker that iteratively discovers the most appropriate
© 2019 Copyright held by the owner/author(s). Published in the Workshop Proceedings          view utility function and refines the recommended views (i.e.,
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on
CEUR-WS.org.                                                                                 histograms or bar charts) with the feedback that the user
                                                                                             provided.
BigVis, 2019, Lisbon, Portugal                                Xiaozhong Zhang, Xiaoyu Ge, Panos K. Chrysanthis, and Mohamed A. Sharaf


    (2) Proposing a suite of optimization techniques that improve the
        efficiency of the system, while minimizes the impact on the
        quality of the recommended views.
    (3) Implementing a prototype system of the proposed ViewSeeker,
        and then verified the effectiveness of our proposed approach
        on both synthetic and real-world datasets.
Outline The rest of the paper is structured as follows. Section 2
introduces our problem definitions. Section 3 presents our proposed
approach. Section 4 and 5 describe our experimental testbed and
results. Section 6 discusses related work and Section 7 concludes
the paper.

2     PROBLEM FORMULATION
In this section, we present the necessary background details of our
work. Specifically, we discuss how views can be constructed through
SQL queries, and then explain how the interestingness of a view                   Figure 2: A target view and its corresponding reference view.
may be captured through a pre-defined utility function. Afterward,
we formally present our proposed problem.
                                                                                 (a, m, f ) used by the target view. An example of a target view with
2.1     Views & Data Visualization                                               its reference view is illustrated in Figure 2.
                                                                                     Deviation measures the difference between the target view and
To begin, we start by describing a view (i.e., histogram or bar chart)           the reference view with an underlying assumption that the greater the
in the context of structural databases. A view vi essentially repre-             distance between the target view and the reference view, the higher
sents an SQL query with a group-by clause over a database D. Under               the utility is. In the case of histograms or bar charts, measuring the
the typical multi-dimensional data models, data can be modeled as                distance between two views vi and v j essentially equals measuring
a set of measure attributes M = {m 1 , m 2 , m 3 , ...} and a set of di-         the distance between two normalized probability distributions P (vi )
mension attributes A = {a 1 , a 2 , a 3 , ...}. The measure attributes (e.g.,    and P (v j ). Formally, the utility score u (vi ) of a view vi computed
number of items sold) is the set of attributes that contain measurable           from deviation can be defined as:
value and can be aggregated. The dimensional attributes (e.g., brand,
year, color, size) is the set of attributes on which measure attributes                               u (vi ) = DT (P (viT ), P (viR ))               (2)
are viewed. To formulate an SQL query with a group-by clause, we                 where DT is the distance function that measures the distance between
need to have some a set of aggregation functions F = { f 1 , f 2 , f 3 , ...}.   two distributions (e.g., Euclidean, Earth Movers Distance).
Thus, we can represent each view vi as a triple (a, m, f ), such that               Consequently, the typical view recommendation problem can be
one dimension attribute a is applied to one aggregation function f               defined as follows:
on the corresponding measure attribute m. Consequently, the View
Space (VS), i.e., the total number of possible views is:                         Definition 1. (View Recommendation Problem) Given a database D,
                                                                                 a user-specified query Q, a set of results R produced by Q, a utility
                        V S = 2 × |A| × |M | × |F |                       (1)    function u (), and the size of the preferred view recommendations K.
   Clearly, VS can be large, especially with high-dimensional data.              Find the top-k views v 1 , v 2 , ..., vk constructed from R that have the
In order to recommend the set of k most interesting views from a                 highest utilities according to u () among all possible views.
large number of views, utility scores are required to rank all the
views. To compute such utility scores, existing literatures have pro-            2.2    Problem Settings
posed a large number of utility functions, some commonly used ones
                                                                                 The above definition of a typical view recommendation problem
includes deviation [27], accuracy [5], usability [5] and p-value [26].
                                                                                 assumes that the utility function is given [19, 21, 28, 29]. In this
Furthermore, these utility functions also contain their own param-
                                                                                 section, we will formalize our problem of interactive view recom-
eters, and any of these functions can further be combined linearly
                                                                                 mendation in which the composition of the utility function is not
with others to form composited utility functions, thus leading to an
                                                                                 given but is discovered.
enormous search space for the utility functions.
   We use the family of deviation-based utility functions as an ex-              Definition 2. (View Utility Function Selection Problem) Consider a
ample to further explain the potential search space of the utility               d-dimensional database D. Further, consider a user-specified query
function and illustrate how the interestingness of a view may be                 Q, a set of results R produced by Q, a set of n possible utility function
measured. For clarity, we call each original view a target view viT ,            U = {u 1 , u 2 , ..., un }, and the size of the preferred view recommen-
which is represented as a triple (a, m, f ) applied to a subset of the           dations K. Find the utility function u p (), which produces top-K
                                                                                                       p p        p
data D Q that is produced by a given user query Q. In order to define            views V p = {v 1 , v 2 , ..., vk }, constructed from R based on user’s
the deviation, we create a helper view called the reference view viR                                          p
                                                                                 feedback, such that u () can be any arbitrary combination of the
for each target view. The reference view visualizes the results of               utility functions in U and most accurately captures the user’s ideal
grouping the data in the whole database D with the same set of triple            utility function u ∗ ().
ViewSeeker: An Interactive View Recommendation Tool                                                                   BigVis, 2019, Lisbon, Portugal


   Clearly, view recommendation with dynamic view utility function               Algorithm 1 The ViewSeeker
selection, is a more challenging problem compared to traditional                 Require: The raw data set D R and a subset D Q specified by a query
view recommendation, given that it has a much higher search space                Ensure: The view utility estimator V E
complexity, because it combines the traditional View Space (Eq. 1)                1: Unlabeled view set U ← дener at eV iews (D Q , D R )
with the search space of the components of the utility functions.                 2: Labeled view set L ← obtain initial set of view labels
                                                                                  3: V E ← initialize view utility estimator V E using L
                U tilityV S = V S × |u 1 ()| × · · · × |un ()|            (3)     4: U E ← initialize uncertainty estimator U E using L
                                                                                  5: loop
where ui () is a utility function, i = 1, .., n.                                  6:    Choose one x from U using U E
   That is, in the context of our problem, we are expanding the                   7:    Solicit user’s label on x
representation of a view vi from a triple (a, m, f ) to be a tuple                8:    L ← L ∪ {x }
(a, m, f , u 1 (), ..., un ()), and central to our solution is discovering the    9:    U ← U − {x }
ideal utility function u ∗ (), interactively through user feedback:              10:    V E ← refine V E using L
                                                                                 11:    U E ← refine U E using L
                     u ∗ () = β 1u 1 () + · · · + βn un ()                (4)    12:    T ← recommend top views using V E
                                                                                 13:    if the user is satisfied with T or the user wants to stop then
where βi is the weights assigned to the corresponding possible utility           14:        Break
function ui , i = 1, ..., n. It can be seen from this equation that u ∗ ()       15:    end if
can be any linear combination of the individual utility functions.               16: end loop
For instance, u ∗ () can be mapped to a single utility function ui , in          17: Return the most recent V E
this case βi = 1 and all other β = 0; or u ∗ () can be a combination
of multiple utility functions, where a set of corresponding β of the
utility functions in u ∗ () sum to 1 and the remaining β are set to 0.           ViewSeeker operates in two phases: an Off-line Initialization, and
   Since our objective is to predict the user’s most preferred utility           Interactive View Recommendation.
function u ∗ () with high precision, we can measure precision by the
distance between the top-K views V p , recommended by our solution               3.1    Off-line Initialization
using the predicted utility function u p (), and those top-K views V ∗           In the first phase, ViewSeeker generates all possible views in an
produced by the ideal utility function u ∗ (). Particularly, a utility           offline pre-processing fashion to facilitate the subsequent execution.
function u p (), which selects top-k views closer to V ∗ , is considered         This phase proceeds in two stages. During the first, for each view vi ,
to be more preferred than a utility function u p (), which selects top-k
                                                 ′
                                                                                 ViewSeeker generates two corresponding views: the reference view
views farther away from V . More details on the evaluation will be
                                ∗
                                                                                 viR and the target view viT . In particular, viR shows the aggregate
provided later, in Section 4.                                                    values from D R and viT shows the corresponding aggregate values
                                                                                 from D Q . Each view is generated by aggregating the data from a
Definition 3. (Interactive View Recommendation Problem) Find
                                                                                 specific dataset along a dimension attribute A, using an aggregation
the solution of the View Utility Function Selection problem such
                                                                                 function F on a measure attribute M, as discussed above.
as the computational delay between each subsequent interactions
                                                                                    To facilitate the computation, ViewSeeker represents each target
(feedback) with the user is within a time constraint tl.
                                                                                 and reference view as probability distributions, which are stored in-
   In each interaction between the system and the user, the user is              ternally as vectors. Consequently, for each view vi , two probability
presented with a set of M views and expected to provide feedback                 distributions, P (viT ) and P (viR ), are obtained for its target view viT
on the interestingness of each view. The feedback should reflect                 and reference view viR , respectively. The conversion from views to
the user’s belief with respect to the interestingness of each view,              probability distributions can be achieved simply through normaliza-
which would be a number ranging from 0-1, with 0 being the less                  tions. As illustrated below, in Equation 5, we normalize each view
interesting and 1 being the most interesting. Utilizing the feedback             vi by individually dividing the aggregate value of each bin in vi by
the system would provide better recommendations of views in the                  the sum of the aggregated values of all bins in vi , such that the sum
subsequent iterations. Furthermore, to enable fluent user interaction,           of aggregated values of all bins in vi would become 1.
the response time between each subsequent iteration must be within                                                   д1 д2      д
the time constraint tl, which is typically below one second.                                              P (vi ) = ⟨ , , ..., b ⟩                      (5)
                                                                                                                     G G         G
                                                                                 where P (vi ) is the probability distribution after normalization; дi are
3    THE VIEWSEEKER                                                              individual values in each bin; G = bi=1 дi is the sum of the values
                                                                                                                         P
In this section, we discuss the details of our proposed ViewSeeker, a            in all bins; and b is the number of bins in the dimension attribute A.
novel view recommendation tool, which interactively finds the set                   In the second stage, ViewSeeker would produce an internal rep-
of views that align best with the user’s interest.                               resentation for each view, which is needed for the training of the
   As shown in Algorithm 1, the ViewSeeker takes a dataset D R to                view utility estimator. As mentioned above, a variety of different
use as reference and a subset of the reference data D Q as inputs.               utility functions have been previously proposed. To find the most
D Q can be specified by any data specification method such as an                 appropriate one that matches the user’s intention, ViewSeeker com-
SQL/NoSQL query over D R . The output of ViewSeeker is a view                    bines each possible utility function into distinct features of views.
utility estimator trained with user’s feedback that predicts the utility         Specifically, we noticed that each previously proposed utility func-
of any given view generated from D R based on user preference.                   tion is essentially a combination of one or more “utility components”
BigVis, 2019, Lisbon, Portugal                              Xiaozhong Zhang, Xiaoyu Ge, Panos K. Chrysanthis, and Mohamed A. Sharaf


(e.g., deviations, usability, accuracy). Thus, we incorporate these           Several query strategies that define the "informativeness" of exam-
components as additional features of the views denoted as utility             ples have been proposed in the literatures (e.g., [14, 22–24]). Due
features. The value of a utility feature of a view vi is the result of the    to the interactive nature of ViewSeeker, we choose to use the most
corresponding “utility component” (e.g., deviation) computed on vi .          efficient query strategies, named uncertainty sampling [14], as the
    For illustration purposes, in our current tool, we have imple-            way to measure the benefit of labeling one view.
mented eight utility features for each view. The first five utility              The intuition underlying uncertainty sampling is that patterns
features are the deviation between target view and reference view             with high uncertainty are hard to classify, and thus if the labels
with different distance measures: Kullback-Leibler divergence (KL-            of those patterns are obtained, they can boost the accuracy of the
divergence), Earth Mover Distance (EMD), L1 distance, L2 distance             classification models. Particularly, in binary classification models
and the maximum deviation in any individual bin. The last three util-         (with class labels 0 and 1), the most uncertain example x is the one
ity features represent the usability [5], accuracy [5], and p-value [26].     which can be assigned to either class label z(x) with probability 0.5.
Usability refers to the quality of the visualization in terms of provid-         Inspired by such idea of uncertainty, also known as least confi-
ing the analyst with an understandable, uncluttered representation,           dence, we adopted the measurement of uncertainty proposed in [14]
which is quantified via the relative bin width metric. Accuracy refers        for binary classification models:
to the ability of the view to accurately capture the characteristics (i.e.,
                                                                                                      u (lc ) (x) = 1 − p(ŷ|x)                   (6)
distribution) of the analyzed data, which is measured in terms of
Sum Squared Error (SSE). The p-value is a statistical term defined as         where u (lc ) (x) is the uncertainty score with a least confidence mea-
“the probability of obtaining a result equal to or more extreme than          surement of x; ŷ means the predicted class label of the unlabeled
what is actually observed, with the given null hypothesis being true”         x. Accordingly, after measuring the uncertainty of each unlabeled
[13]. In the problem of view recommendation, the null hypothesis              sample, the unlabeled sample with highest uncertainty is selected:
refers to the reference view, and the extremeness of the results refers
                                                                                                        x∗ = argmaxxu (x)                         (7)
to the interestingness of the target views.
    It should be noted that, in general, users may customize the util-        where u (x) can be any other measurement of informativeness over
ity features, including adding new ones, for personalized analysis.           the unlabeled sample x.
The current set of utility features mentioned above are selected to              Since estimating the uncertainty of a given view requires a proba-
illustrate the effectiveness of ViewSeeker.                                   bilistic based machine learning model, the view utility estimator (i.e.,
                                                                              non-probabilistic linear regression model) cannot be used to obtain
3.2    Interactive View Recommendation                                        the uncertainty score. To overcome this challenge, we employed
In the second phase, ViewSeeker interactively presents views to the           a separate Logistic Regression model trained on the same set of
user and then requests user feedback on each presented view—we                labeled views as the view utility estimator to serve as the uncertainty
denote such feedback as a label. In each iteration, ViewSeeker would          estimator, which estimates the uncertainty of each view.
present M example views (default M = 1) selected from all possible               Machine learning models such as the uncertainty estimator must
views to the user, and the user is expected to express their interest         be trained with both positive (i.e., interesting) and negative (i.e.,
with respect to each example view with a numeric label that ranges            not interesting) views. For this reason, ViewSeeker splits its second
between 0.0 - 1.0 with 1.0 being the most interesting—Example                 phase into two stages as well, such that the first stage addresses the
values would be 0.0 (not interesting), 0.7, 0.9, 1.0.                         “cold start" issue of the system and the second stage quickly refines
   After each iteration, ViewSeeker would use all collected feedback          the view utility estimator to discover the set of k most interesting
to train a Linear Regression model as the view utility estimator that         views.
predicts for each view the corresponding score produced by the ideal             The “cold start" issue is basically the acquisition of the first
utility function u ∗ (). We choose Linear Regression as the view utility      positive view. To facilitate this process, ViewSeeker would first
estimator because the task for predicting the utility score of a view         select views ranked highest according to each utility feature (e.g.,
can naturally be seen as a regression task.                                   deviation, accuracy, usability, p-value). Each utility feature would
   It can be easily noticed that the effectiveness of our approach            then be considered in a sequential manner, such that in each iteration
depends heavily on the example views being presented to the user              the set of M views ranked highest according to the current utility
for labeling. However, to measure exactly the benefit of each user            feature under consideration would be chosen from all possible views
label before obtaining it is difficult, especially given the fact that        and then presented to the user for their feedback (i.e., labels). In
the delay between each subsequent iteration cannot exceed the time            the case where no positive or negative feedback has been received
constraint tl. Fortunately, our problem of finding the most beneficial        after visiting all dimensions, ViewSeeker will then switch to random
views to be labeled aligned with the objective of Active Learning             sampling for the subsequent interactions with the user.
techniques [22].                                                                 In the second stage of the second phase, ViewSeeker uses the
   Active Learning is an interactive machine learning framework               uncertainty estimator to choose the most informative views to be
that achieves accurate classification with minimum human super-               presented to the user for feedback in each iteration. Specifically,
vision. A typical active learning approach would employ a query               when the output likelihood p(ŷ|x) of the uncertainty estimator of
strategy to sequentially select which unlabeled example (i.e., object)        a view is closest to 50%, this view is considered to be the most
in the database should be presented to the user next for labeling. A          uncertain, and thus should be preferred to views that are less un-
query strategy attempts to minimize the labeling costs by selecting           certain. During each iteration, ViewSeeker would sample as many
the most informative examples for building the classification model.          distinct views as possible within the given time constraint tl and
ViewSeeker: An Interactive View Recommendation Tool                                                           BigVis, 2019, Lisbon, Portugal

                   Table 1: Testbed Parameters                                         Table 2: Simulated Ideal Utility Functions
 Total number of records                          10 × 105 (DIAB)                    #     Involved utility features and weights
                                                   10 × 106 (SYN)                    1     1.0 * KL
 Cardinality ratio of records in D Q                     0.5 %                       2     1.0 * EMD
 Number of dimension attributes (A)              7 (DIAB), 5 (SYN)                   3     1.0 * MAX_DIFF
 Number of distinct values in A            variable (DIAB), 3 and 4 (SYN)            4     0.5 * EMD + 0.5 * KL
 Number of measure attributes (M)                8 (DIAB), 5 (SYN)                   5     0.5 * EMD + 0.5 * L2
 Number of aggregation functions                           5                         6     0.5 * EMD + 0.5 * p-value
 Number of view utility feature                            8                         7     0.3 * EMD + 0.3 * KL + 0.4 * MAX_DIFF
 Utility estimator                                 Linear regressor                  8     0.3 * EMD + 0.3 * L2 + 0.4 * MAX_DIFF
 Number of views presented per iteration                   1                         9     0.3 * EMD + 0.3 * p-value + 0.4 * MAX_DIFF
 Performance measurement                           Top-k precision                   10    0.3 * EMD + 0.3 * KL + 0.4 * Usability
 The number of views to recommend (k)            5, 10, 15, 20, 25, 30               11    0.3 * EMD + 0.3 * KL + 0.4 * Accuracy
 Optimization partial data ratio α                        10%
 Optimization time limit per iteration                 1 second
 Optimization performance measurement       Top-k precision, Running time   ViewSeeker to hide the necessary computation, and in turn, makes
                                                                            the delays transparent to the user.
                                                                               To utilize spare computing power efficiently, it is important to
then present M views with the highest uncertainty scores to the user        determine the order of the incremental view computation. To do so,
for feedback. After each iteration, the newly acquired label(s) (i.e.,      ViewSeeker uses the current view utility estimator to rank the views,
feedback) would be merged with all previously obtained labels for           and the views ranked highly would have higher priority in computing
use as the training data for the refinement of both the view utility        the accurate utility features. Effectively, these optimizations allow
estimator and uncertainty estimator.                                        ViewSeeker to reduce the unnecessary computation by pruning out
   After each refinement, the system will show the user the set of          the calculations for views that are less promising.
top-k views ranked highest according to the utility score produced
by the most recent view utility estimator. Once the user is satisfied
                                                                            4   EXPERIMENTAL TESTBED
with the set of recommended views, the entire exploration process
terminates, and the most recently trained view utility estimator u p ()     We evaluated the benefit of using ViewSeeker through a simulated
is selected as the estimation of u ∗ ().                                    user study, and measured both efficiency and effectiveness. Here we
                                                                            present the details of our testbed (Table 1).
3.3    Optimizations                                                        Setup We built a ViewSeeker platform in Python. Our experiments
In this section, we introduce the key optimizations built in ViewSeeker     were performed on a Core i5 server with 8GB of RAM.
to speed up the interactive recommendation process while minimiz-           Data Sets Two datasets were used in the experiments: DIAB & SYN.
ing the reduction in the effectiveness of the recommendation.                  DIAB is a categorical dataset of diabetic patients [1]. It contains
   Recall that in the first phase, ViewSeeker converts each basic           100 thousand records. We removed the attributes that have a large
utility component into utility features and embeds them into the            amount of missing data or are very sparse. After preprocessing, the
vector representation of the views. During this conversion, each            data set had 7 dimension attributes A, and 8 measure attributes M,
utility feature needs to be calculated for all possible views, which is     and a total of 280 distinct views were generated.
time-consuming. As the majority of the views are typically not of the          SYN is a synthetic dataset with 1 million numerical records that
user’s interest, it is unnecessary to fully compute these view utilities    contains 5 dimension attributes, 5 measure attributes, and 2 bin
with the entire database. Instead, we propose an optimization that          configurations (i.e., we create views with 3 bins or 4 bins). The
aims to compute only the set of utilities for views that are more           values of the attributes of each record are uniformly distributed. The
likely to be of interest to the user.                                       total view space for the SYN dataset is 250 distinct views.
   Specifically, our optimization works as follows. During the first        Simulated User Study Initially, we generated the view utility fea-
phase, we compute the set of utility features for each possible view        tures in two-step: 1) we created a hypercube in the recording space
only with α percent of the data, where α is a pre-defined ratio, which      to represent D Q , which is a subset of data specified by a query; and
can be tuned based on data size, available system resources, and            2) we generated view utility features based on the different utility
time constraint tl. In particular, ViewSeeker will uniformly sample         functions mentioned in Section 3. Afterward, we simulated different
α percent of data from the underlying database to produce a "rough"         user utility functions u ∗ (), each of which is a weighted sum of one
utility score for each utility feature of each view.                        or more individual utility functions. For each presented view vi , we
   Later on, during the second phase, ViewSeeker will incrementally         simulated the user’s belief with respect to the interestingness of a
refine the utility score of each view with the entire set of data when-     view through the normalized utility score produced by the u ∗ (vi ),
ever there is spare computing power available between user labeling         such that u ∗ (vi ) = 0.7 indicates the interestingness of view vi is
prompts while ensuring the time constraint tl is obeyed. In other           about 70% of the maximum.
words, the main idea is to quickly compute a set of initial "rough"         Performance We evaluated the performance of our system in the
values for the utility features of each view to enable subsequent in-       aspect of recommendation precision. In particular, we showed the
teractions and then incrementally compute the accurate utility scores       effectiveness of our proposed approach by measuring the number
for each view while leveraging the user interaction time. This allows       of example views needed to reach 100% precision. Here we define
BigVis, 2019, Lisbon, Portugal                           Xiaozhong Zhang, Xiaoyu Ge, Panos K. Chrysanthis, and Mohamed A. Sharaf


                   (a) Single Component u∗ ()                 (b) Two Components u∗ ()                (c) Three Components u∗ ()

                    Figure 3: Recommendation precision for DIAB dataset with different ideal utility functions u∗ ().


                   (a) Single Component u∗ ()                 (b) Two Components u∗ ()                (c) Three Components u∗ ()

                     Figure 4: Recommendation precision for SYN dataset with different ideal utility functions u∗ ().


the precision as the size of the intersection between the top-k views        is based), and the y-axis is the number of example views presented,
recommended by the ViewSeeker and the top-k views recommended                capturing the user effort required for the precision to reach 100%.
by the u ∗ (). For two sets of top-k views V p and V ∗ produced by               Specifically, in Figures 3a and 4a, we evaluated the effectiveness
ViewSeeker and u ∗ (), respectively, the precision is calculated as:         of ViewSeeker with respect to the ideal utility functions that contain
 |V p ∩ V ∗ |                                                                only a single utility component (i.e., results are averaged over ideal
              .
      k                                                                      Utility Functions 1-3 in Table 2). In Figures 3b and 4b, we repeated
Simulated Ideal Utility Functions We evaluated the effectiveness             the evaluation for composited ideal utility functions that consists of
and efficiency using 11 diverse ideal utility functions u ∗ () that in-      two utility components (i.e., results are averaged over ideal Utility
cluded 3 single component utility functions and 8 multi-component,           Functions 4-6). In Figures 3c and 4c, we repeated the evaluation for
composite utility functions (Table 2). We chose the components in            composite ideal utility functions with three utility components (i.e.,
multi-component u ∗ () carefully such that they represent different          results are averaged over ideal Utility Functions 7-11).
characteristics of the view. For example, EMD focuses on devia-                  From these results, we can observe that our proposed ViewSeeker
tions across bins, KL-divergence indicates the sum of deviation in           is extremely effective in discovering the set of ideal top-k views:
individual bins, Usability represents the visual quality of a view, etc.     for k ranging from 5-30, on average only 7-16 labels were required
                                                                             before the ViewSeeker reached a precision of 100% for both DIAB
                                                                             and SYN datasets. Clearly, this indicates that only a small amount
5     EXPERIMENTAL RESULTS                                                   of user effort is needed before a satisfactory set of results can be
We evaluated both the effectiveness of ViewSeeker and its optimiza-          obtained by the ViewSeeker.
tions by contacting three experiments.                                           Experiment 2: We compared the top-k recommended views by
                                                                             ViewSeeker with the top-k recommended views by the baselines in
5.1    Evaluation of Effectiveness                                           terms of the maximum achievable recommendation precision. We
We contacted two experiments to evaluate the effectiveness of the            use the 8 individual utility features (e.g., KL, EMD, L1, L2, etc.) as
ViewSeeker. The first measured the user effort, whereas the second           the baselines. Figure 5 shows the result for ideal Utility Function 11
measured the precision of the predicted utility function.                    (i.e., u ∗ () = 0.3 ∗ EMD + 0.3 ∗ KL + 0.4 ∗ Accuracy) in the DIAB
   Experiment 1: Figures 3 and 4 illustrate the effectiveness of the         dataset. We observe that the ViewSeeker achieved a 3X improvement
ViewSeeker by showing the number of example views that need                  against the best-performing baseline (i.e., EMD) in recommendation
to be labeled in order for the view utility estimator to reach 100%          precision. This indicates that using single fixed utility features will
precision in the top-k recommended views. Here, the x-axis is the k          not be able to capture a complex ideal utility function, which best
in top-k (i.e., the number of views on which the precision calculation       captures the user’s interest, such as the one above.
ViewSeeker: An Interactive View Recommendation Tool                                                                  BigVis, 2019, Lisbon, Portugal


                                                                                  to train a general machine learning model that predicts the utility
                                                                                  score of any given view [16]. The generic priors were obtained by
                                                                                  hiring a large number (i.e., 100) of human annotators to annotate
                                                                                  multiple real-world datasets. The key difference between our work
                                                                                  and all prior work is that all previous works use predefined view
                                                                                  utility functions and do not discover the utility function that best
                                                                                  matches an individual user’s intention and exploration task.
                                                                                     Interactive Visualization Tools have been studied extensively
                                                                                  for the past few years [3, 7, 9, 11, 12, 15, 17, 25]. Unlike visual-
                                                                                  ization recommendation tools such as ViewSeeker that recommend
                                                                                  visualization automatically by searching through the entire views
                                                                                  spaces, traditional interactive visualization tools require the user to
                                                                                  manually specify the views to be generated. Recently, a few interac-
Figure 5: Precision comparison with individual utility features                   tive visualization tools have attempted to automate part of the data
                                                                                  analysis and visualization process. For instance, Profiler automati-
                                                                                  cally helps analysts detect anomalies in the data [11]. But, unlike our
5.2    Evaluation of Optimizations                                                approach, Profiler is not capable of providing general recommenda-
We evaluated the effectiveness of our optimization techniques by                  tions for any group-by queries. Another recent example is VizDeck
comparing the recommendation precision and runtime between the                    [12], which generates all possible 2D views for a given dataset and
optimizations-enabled ViewSeeker and the optimizations-disabled                   allows the user to manually control (e.g., reordering, pinning) these
ViewSeeker (i.e., baseline model).                                                views through a dashboard, rather than using a utility function. The
   When comparing the recommendation precision, in order to elim-                 work [3] is the closest to our work in the sense that it also uses user
inate the non-determinism in selecting the k t h view in top-k we                 feedback to steer the view exploration. However, the user feedback
introduced the concept of utility distance (UD), which is defined as:             is only used to train a classifier, which is not capable to estimate the
                       X                          X                             ideal utility function and get the top-k views.
               UD =               u ∗ (v i ) −              u ∗ (v i ) /k   (8)
                       v i ∈V ∗                  v i ∈V p
                                                                                     Data Exploration techniques that aim to efficiently extract knowl-
                                                                                  edge from data [20] are complementary to our work. In particular,
where V ∗ are top-k views recommended by the ideal utility function               example-driven data exploration approaches [4, 8] assume mini-
u ∗ (), and V p are top-k views recommended by ViewSeeker.                        mum prior knowledge of the data and share the same underlying
    To explain the top-k non-determinism that motivated U D, note                 approach as ViewSeeker. These works aim to iteratively construct
that views directly after the k t h view may have very close, or even             the exploratory query through user interactions as ViewSeeker itera-
identical, utility as the k t h view. In such cases, changing the order           tively discovers the utility function using user feedback. ViewSeeker
among these close views should not affect the precision too much, if              is well suited to such situations and can enhance example-driven
any. U D has this desired property because its evaluation focuses on              data exploration by creating visualizations that illustrate interesting
utility distance instead of the exact inclusion of the top-k views.               pattern during the construction of the exploratory queries.
    Figures 6 and 7 show the number of feedback and runtime, respec-
tively, needed for both models to reach U D = 0 for the DIAB dataset.             7   CONCLUSION
On average, the model with optimization achieved 43% reduction in                 In this work, we addressed the challenge in visual analytics tools
running time while requiring only 19% more user labeling effort.                  of recommending the set of views most preferred by the user dur-
    The experiment result confirmed that our optimization methods                 ing an exploration task. Our contribution, ViewSeeker, is a novel,
are effective in reducing computing time by pruning out calculat-                 interactive view recommendation tool that offers a solution to the
ing for views that are less promising. The increase in the required               fundamental challenges of: 1) providing effective results with mini-
labeling effort is because the view utility features computed on par-             mum user effort, 2) enabling efficient and scalable computation, and
tial data is only an estimation and may be discrepant from the true               3) requiring no special expertise from the users.
features, which hinders the learning of the ideal utility function.                  The crux of ViewSeeker is the discovery of the utility function,
                                                                                  which is used to select the views that best match the user’s explo-
6     RELATED WORKS                                                               ration task. ViewSeeker employs an active learning-based predictive
In this section, we review works that are strongly relevant to ours.              model and effectively learns the user’s preferred views through sim-
   View Recommendation techniques automatically generate all                      ple interactions between the user and the system. To enable fluent
possible views of data, and recommend the top-k interesting views,                user interactions, we proposed a set of optimization techniques that
according to some utility function (e.g., [5, 10, 16, 18, 19, 21, 27–             significantly improved the runtime efficiency of the ViewSeeker. Our
29]). A key difference among these works is the proposed utility                  experimental results, using both synthetic and real-world datasets,
functions. Recent work placed addition constraints, e.g., upper bound             showed that our proposed ViewSeeker outperforms the alternative
on the number of views to be explored and execution time limit when               baseline approaches by a significant margin.
computing the recommended views [10]. The most close work to                         In the future, we plan to conduct a user study to validate the
ours is the use of generic priors (i.e., general knowledge of how                 recommendation effectiveness of ViewSeeker, and to extend it to
people associate views with different datasets and exploration tasks)             support more visualization types, such as scatter plot, line chart etc.
BigVis, 2019, Lisbon, Portugal                                        Xiaozhong Zhang, Xiaoyu Ge, Panos K. Chrysanthis, and Mohamed A. Sharaf


             (a) UD = 0 for Single Component u∗ ()                  (b) UD = 0 for Two Components u∗ ()            (c) UD = 0 for Three Components u∗ ()

            Figure 6: Recommendation Precision with optimization for DIAB dataset with different ideal utility functions u∗ ().


                (a) RT for Single Component u∗ ()                     (b) RT for Two Components u∗ ()                 (c) RT for Three Components u∗ ()

                Figure 7: System Runtime (RT) with optimization for DIAB dataset with different ideal utility functions u∗ ().


Acknowledgments We would like to thank the anonymous review-                              [14] David D. Lewis and William A. Gale. 1994. A sequential algorithm for training
ers for their helpful comments. This work was supported, in part, by                           text classifiers. In ACM SIGIR.
                                                                                          [15] M. Livny, R. Ramakrishnan, K. Beyer, G. Chen, D. Donjerkovic, S. Lawande,
NIH under award U01HL137159. This paper does not represent the                                 J. Myllymaki, and K. Wenger. 1997. DEVise: Integrated Querying and Visual
views of NIH.                                                                                  Exploration of Large Datasets. In ACM SIGMOD.
                                                                                          [16] Yuyu Luo, Xuedi Qin, Nan Tang, and Guoliang Li. 2018. DeepEye: Towards
                                                                                               Automatic Data Visualization. In IEEE ICDE.
REFERENCES                                                                                [17] J. Mackinlay, P. Hanrahan, and C. Stolte. 2007. Show Me: Automatic Presentation
 [1] [n. d.]. https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes. ([n. d.]).        for Visual Analysis. IEEE TVCG 13, 6 (2007), 1137–1144.
 [2] [n. d.]. Tableau public. http://public.tableau.com. ([n. d.]).                       [18] Rischan Mafrur, Mohamed A. Sharaf, and Hina A. Khan. 2018. DiVE: Diversify-
 [3] Michael Behrisch, Fatih Korkmaz, Lin Shao, and Tobias Schreck. 2014. Feedback-            ing View Recommendation for Visual Data Exploration. In ACM CIKM.
     driven interactive exploration of large multidimensional data supported by visual    [19] Dominik Moritz, Chenglong Wang, Greg L Nelson, Halden Lin, Adam M Smith,
     classifier. In IEEE VAST.                                                                 Bill Howe, and Jeffrey Heer. 2019. Formalizing visualization design knowledge as
 [4] Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. 2014. Explore-by-               constraints: actionable and extensible models in Draco. IEEE TVCG 25, 1 (2019),
     example: an automatic query steering framework for interactive data exploration..         438–448.
     In ACM SIGMOD.                                                                       [20] Davide Mottin, Matteo Lissandrini, Yannis Velegrakis, and Themis Palpanas. 2017.
 [5] Humaira Ehsan, Mohamed A. Sharaf, and Panos K. Chrysanthis. 2016. MuVE:                   New Trends on Exploratory Methods for Data Analytics. In VLDB. 1977–1981.
     Efficient Multi-Objective View Recommendation for Visual Data Exploration. In        [21] Belgin Mutlu, Eduardo Veas, and Christoph Trattner. 2016. Vizrec: Recommend-
     IEEE ICDE.                                                                                ing personalized visualizations. ACM TiiS 6, 4 (2016), 31.
 [6] H. Ehsan, M. A. Sharaf, and P. K. Chrysanthis. 2018. Efficient Recommendation        [22] Burr Settles. 2009. Active learning literature survey. TR. U. Wisconsin-Madison.
     of Aggregate Data Visualizations. IEEE TKDE 30, 2 (2018), 263–277.                   [23] Burr Settles and Mark Craven. 2008. An Analysis of Active Learning Strategies
 [7] D. Fisher. 2007. Hotmap: Looking at Geographic Attention. IEEE TVCG 13, 6                 for Sequence Labeling Tasks. In ACL EMNLP.
     (2007), 1184–1191.                                                                   [24] H. S. Seung, M. Opper, and H. Sompolinsky. 1992. Query by Committee. In ACM
 [8] Xiaoyu Ge, Yanbing Xue, Zhipeng Luo, Mohamed A. Sharaf, and Panos K.                      COLT.
     Chrysanthis. 2016. REQUEST: A Scalable Framework for Interactive Construction        [25] Chris Stolte and Pat Hanrahan. 2000. Polaris: A System for Query, Analysis and
     of Exploratory Queries. In IEEE International Conference on Big Data.                     Visualization of Multi-Dimensional Relational Databases. In IEEE INFOVIS.
 [9] Hector Gonzalez, Alon Y. Halevy, Christian S. Jensen, Anno Langen, Jayant            [26] Bo Tang, Shi Han, Man Lung Yiu, Rui Ding, and Dongmei Zhang. 2017. Extract-
     Madhavan, Rebecca Shapley, Warren Shen, and Jonathan Goldberg-Kidon. 2010.                ing Top-K Insights from Multi-dimensional Data. In ACM SIGMOD.
     Google fusion tables: web-centered data management and collaboration. In ACM         [27] Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya G. Parameswaran, and
     SIGMOD.                                                                                   Neoklis Polyzotis. 2015. SEEDB: Efficient Data-Driven Visualization Recom-
[10] Ibrahim A. Ibrahim, Abdullah M. Albarrak, Xue Li. 2017. Constrained Recom-                mendations to Support Visual Analytics. PVLDB 8, 13 (2015), 2182–2193.
     mendations for Query Visualizations. Knowl. Inf. Syst. 51, 2 (2017), 499–529.        [28] Kanit Wongsuphasawat, Dominik Moritz, Anushka Anand, Jock Mackinlay, Bill
[11] Sean Kandel, Ravi Parikh, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey             Howe, and Jeffrey Heer. 2016. Voyager: Exploratory analysis via faceted browsing
     Heer. 2012. Profiler: integrated statistical analysis and visualization for data          of visualization recommendations. IEEE TVCG 1 (2016).
     quality assessment. In ACM AVI.                                                      [29] Kanit Wongsuphasawat, Zening Qu, Dominik Moritz, Riley Chang, Felix Ouk,
[12] Alicia Key, Bill Howe, Daniel Perry, and Cecilia R. Aragon. 2012. VizDeck:                Anushka Anand, Jock Mackinlay, Bill Howe, and Jeffrey Heer. 2017. Voyager
     self-organizing dashboards for visual analytics. In ACM SIGMOD.                           2: Augmenting visual analysis with partial view specifications. In ACM CHI.
[13] Martin Krzywinski and Naomi Altman. 2013. Significance, P values and t-tests.             2648–2659.
     In Nature methods.