=Paper= {{Paper |id=Vol-1882/paper11 |storemode=property |title=Facilitating User Interaction With Data |pdfUrl=https://ceur-ws.org/Vol-1882/paper11.pdf |volume=Vol-1882 |authors=Zainab Zolaktaf |dblpUrl=https://dblp.org/rec/conf/vldb/Zolaktaf17 }} ==Facilitating User Interaction With Data== https://ceur-ws.org/Vol-1882/paper11.pdf
                          Facilitating User Interaction With Data

                                                              Zainab Zolaktaf
                                                     Supervised by Rachel Pottinger
                                                      University of British Columbia
                                                        Vancouver, B.C, Canada
                                                      {zolaktaf, rap}@cs.ubc.ca



ABSTRACT                                                                  ranging from high school students to professional astronomers,
In many domains, such as scientific computing, users can                  with varied levels of skills and knowledge, interact with this
directly access and query data that is stored in large, and               database. Furthermore, scientific databases are typically
often structured, data sources. Discovering interesting pat-              used for Interactive Data Exploration (IDE), where users
terns and efficiently locating relevant information, however,             pose exploratory queries to understand the content and find
can be challenging. Users must be aware of the data con-                  patterns [13]. Efficiently composing queries over this data to
tent and its structure, before they can query it. Further-                discover interesting patterns, is one of their main challenges.
more, they have to interpret the retrieved results and pos-                  After successfully composing the query, the next step is
sibly refine their query. Essentially, to find information, the           to interpret query answers. However, the retrieved results
user has to engage in a repeated cycle of data exploration,               can often be difficult to understand. For example, consider
query composition, and query answer analysis. The focus of                an aggregate query SELECT AVG(TEMPERATURE) over climate
my PhD research is on designing techniques that facilitate                data. In the weather domain, observational data regard-
this interaction. Specifically, I examine the utility of recom-           ing atmospheric conditions is collected by several weather
mender systems for the data exploration and query compo-                  stations, satellites, and ships. For the same data point, e.g.,
sition phases, and propose techniques that assist users in the            temperature on a given day, there can be conflicting and du-
query answer analysis phase. Overall, the solutions devel-                plicate values. Consequently, the aggregate query can have
oped in my thesis aim to increase the efficiency and decision             an overwhelming number of correct and conflicting answers.
quality of users.                                                         Here, mechanisms that aid the user in understanding the
                                                                          query answers are required.
                                                                             In my thesis, I develop techniques that assist user inter-
1.    INTRODUCTION                                                        action with data. I consider the data exploration and query
   With the advent of technology and the web, large volumes               composition phase, and examine the utility of recommen-
of data are generated and stored in data sources that evolve              dation systems for this phase. Furthermore, I consider the
and grow over time. Often, these sources are structured as                query answer analysis phase and devise efficient techniques
relational databases that users can directly query and ex-                that provide insights about query answers. More precisely,
plore. For instance, astronomical measurements are stored                 I study three problems: 1. how do classical recommenda-
in a large relational database, called the Sloan Digital Sky              tion systems perform with regards to exploration tasks in
Survey (SDSS) [19, 21]. Climate data collected from various               standard recommendation domains, and how can we mod-
sources is integrated in relational databases and offered for             ify them to facilitate data exploration more rigorously (Sec-
analysis by users [5].                                                    tion 2)? 2. what are the challenges of recommendation in the
   At a high level, user interaction with data involves two               relational database context and which algorithms are appro-
phases: a query composition phase, where the user com-                    priate for helping users explore data and compose queries
poses and submits a query, and a query answer analysis                    (Section 3)? 3. how can we assist users in the query an-
phase, where the user analyses query answers produced by                  swer analysis phase (Section 4)? Overall, I aim to develop
the system. During both phases, however, users can face                   techniques that help users explore data and increase their
problems in understanding the data.                                       decision quality.
   Consider, for example, the scientific computing domain.
The SDSS schema has over 88 tables, 51 views, 204 user-
defined functions, and 3440 columns [14]. A variety of users,             2.   FACILITATING DATA EXPLORATION WITH
                                                                               RECOMMENDER SYSTEMS
                                                                             One way to facilitate data navigation and exploration is to
                                                                          find and suggest items of interest to users by deploying a rec-
                                                                          ommendation system [6, 16, 29]. Classical recommendation
                                                                          systems are categorized into content-based and collaborative
                                                                          filtering methods.
 c Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. Mu-           Content-based methods use descriptive features such as
nich, Germany.
Copyright (C) 2017 for this paper by its authors. Copying permitted for   genre of movies, or user demographics, to construct informa-
private and academic purposes.                                            tive user and item profiles, and measure similarity between
them. But descriptive features might not be available. Col-                         Algorithm         P@5     R@5     L@5     C@5
                                                                                    Random            0.000   0.000   0.871   0.873
laborative filtering methods instead infer user interests from                      Pop [6]           0.051   0.080   0.000   0.002




                                                                          MT-200K
user interaction data. The main intuition is that users with                        MF [28]           0.000   0.000   1.000   0.001
similar interaction patterns have similar interests.                                5D ACC [10]       0.000   0.000   0.995   0.157
   The interaction data may include explicit user feedback on                       CofiR [24]        0.025   0.046   0.066   0.020
                                                                                    PureSVD [6]       0.018   0.022   0.001   0.067
items, such as user ratings on movies, or implicit feedback,                        θ ∗ Dyn900 [20]   0.027   0.050   0.416   0.171
                                                                                        Pop
such as purchasing history, browsing and click logs, or query
logs [11]. An important property of the interaction data is
that the majority of items (users) receive (provide) little                Table 1: Top-5 recommendation performance.
feedback and are infrequent, while a few receive (provide)
lots of feedback and are frequent. But many models only
work well when there is a lot of data available, i.e., they make    personalization method was independent of the underlying
good recommendations for frequent users, and are biased             recommendation model.
toward recommending frequent items [6, 15, 17].                        We evaluated our framework on several standard datasets
   However, recommending popular items is not sufficient for        from the movie domain. Table 1 shows the top-5 recom-
exploratory tasks. Users are likely already aware of popu-          mendation performance for the MovieTweetings 200K (MT-
lar items or can find them on their own. Concentrating on           200K) dataset [9] which contains voluntary movie rating
popular items also means the system has low overall cover-          tweets from users. For accuracy, we computed precision
age of the item space in its recommendations. It is essential       (P@5) and recall (R@5) [6] wrt the test items of users. Long-
to develop methods that help users discover new items that          tail accuracy (L@5) [10], is the normalized number of long-
may be less common but more interesting. Therefore, we              tail items in top-5 sets per user. Long-tail items are those
investigate the following research question:                        that generate the lower 20% of the total ratings in the train
                                                                    set, based on the Pareto principle or the 80/20 rule [26].
      How do existing recommendation models perform with            Coverage (C@5) [10] is the ratio of the number of distinct
      regard to data exploration tasks in standard recom-           items recommended to all users, to the number of items.
      mendation domains, and how can they be modified to               We compared with non-personalized baselines: Random
      facilitate data exploration more rigorously?                  that has high coverage but low accuracy, and most popu-
                                                                    lar recommendation (Pop) [6], that provides accurate top-N
To answer this question, we focus on top-N item recommen-           sets but has low coverage and long-tail accuracy. We also
dation, where the goal is to recommend the most appealing           compared with personalized algorithms: matrix factoriza-
set of N items to each user [6]. Informally, the problem set-       tion (MF) with 40 factors, L2-regularization, and stochastic
ting is as follows: we are given a log of explicit user feedback,   gradient descent optimization [28], a resource allocation ap-
e.g., ratings, for different items. We want to assign a set of      proach that re-ranks MF (5D ACC) [10], CofiRank with re-
N unseen items to each user.                                        gression loss (CofiR ) [24], and PureSVD with 300 factors [6].
                                                                    On MT-200K, we chose the non-personalized Pop algorithm
2.1    Solution                                                     as our accuracy recommender, and combined it with a dy-
   In our solution [20], we focused on promoting less fre-          namic coverage recommender (Dyn900) introduced in [20].
quent items, or long-tail items, in top-N sets to facilitate        Our personalized algorithm is denoted θ∗Dyn900     . Table 1
                                                                                                                Pop
exploration. Recommending these items introduces novelty            shows that while most baselines achieve best performance in
and serendipity into top-N sets, and allows users to discover       either coverage or accuracy metrics, θ∗Dyn900 has high cover-
                                                                                                           Pop
new items. It also increases the item-space coverage, which         age, while maintaining reasonable accuracy levels. Further-
increases profits for providers of the items [3, 6, 26, 22].        more, it outperforms the personalized algorithms, PureSVD
Our main challenge was in promoting long-tail items in a            and CofiR , in both accuracy and coverage metrics.
targeted manner, and in designing responsive and scalable
models. We used historical rating data to learn user pref-
erence for discovering new items. The main intuition was
                                                                    3.   FACILITATING DATA EXPLORATION AND
that the long-tail preference of user u, captured by θu∗ , de-           QUERY COMPOSITION
pends on the types of long-tail items she rates. Moreover,             Getting information out of database systems is a major
the long-tail type or weight of item i, captured by wi , de-        challenge [12]. Users must be familiar with the schema to be
pends on the long-tail preference of users who rate that item.      able to compose queries. Some relational database systems,
Based on this, we formulated a joint optimization objective         e.g, SkyServer, provide a sample of example queries to aid
for learning both unknown variables, θ ∗ and w.                     users with this task. However, compared to the size of the
   Next, we integrated the learned user preference estimates,       database and complexity of potential queries, this sample
θ ∗ , into a generic re-ranking framework to provide customized     set is small and static. The problem is exacerbated as the
balance between accuracy and coverage. Specifically, we de-         volume of data increases, particularly for IDE. A mechanism
fined a re-ranking framework that required three compo-             that helps users navigate the schema and data space, and
nents: 1. an accuracy recommender that was responsible for          exposes relevant data regions based on their query context,
recommending accurate top-N sets. 2. a coverage recom-              is required. We consider using recommendation systems in
mender that was responsible for suggesting top-N sets that          this setting and focus on the following research question:
maximized coverage across the item space, and consequently
promoted long-tail items. 3. the user long-tail preference.              What are the challenges of recommendation in the
   In contrast to prior related work [1, 10, 27], our frame-             database context, and which algorithms are suitable
work learned the personalization rather than optimizing us-              for facilitating interactive exploration and navigation
ing cross-validation or parameter tuning; in other words, our            of relational databases?
   To answer this question, we address top-N aspect recom-        Subsequently, we can use a vector-based query representa-
mendation, where the goal is to suggest a set of N aspects        tion model where each element denotes the presence of a cer-
to the user that facilitate query composition and database        tain aspect. Alternatively, a graph-based representation [23]
exploration. Similar to the collaborative filtering setting in    might be more suitable. After formulating similarity mea-
Section 2, we analyse user interaction data, available in a       sures between queries (or sessions) [2], we can use a nearest
query log. Informally, the problem setting is as follows: we      neighbour model to suggest relevant aspects to the user.
are given a query log that is partitioned into sessions, sets        In contrast to prior work that focuses on supervised learn-
of queries submitted by the same user. Furthermore, we            ing and query rewriting [7], we focus on aspect definition
also have a relational database synopsis with information         and extraction. In contrast to [4, 7, 8], we rely on the
about the schema of the database (#relations, #attributes,        database synopsis only. Accessing a large scientific database
and foreign key constraints) and the range of numerical at-       like SDSS to retrieve the entire set of tuples is expensive. In
tributes. Given a new partial session, the objective is to        contrast to [14] our recommendations include intervals not
recommend potential query extensions, or aspects.                 just tables and attributes. The intermediate query format
                                                                  in [18] is complementary to our work.
3.1    Proposed Work
  To formulate an adequate solution, the following chal-          4.    FACILITATING QUERY ANSWER ANAL-
lenges must be addressed:                                               YSIS
                                                                     After users have successfully submitted a query, their next
  1. Aspect Definition. There is no clear notion of “item”        challenge is to analyse and understand the query answers.
     or aspect in this setting. Instead, we need to find an       When the answer set is small, this task is attainable. The
     adequate set of aspects that can be used to to cap-          challenge is in examining and interpreting large, or even
     ture user intent and characterize queries. Given the         conflicting, answer sets.
     exploratory nature of queries in the scientific domains,        To illustrate the problem, consider again climate data
     the aspects should enable both schema navigation and         that is reported by various sources and integrated in re-
     data space exploration.                                      lational databases. Because the sources were independently
                                                                  created and maintained, a given data point can have mul-
  2. Sequential Aspects and Domain-Specific Constraints.          tiple, inconsistent values across the sources. For example,
     Individual elements in a SQL query are sequential and        one source may have the high temperature for Vancouver
     there is dependency between them. For instance, in           on 06/11/2006 as 17C, while another may list it as 19C.
     SELECT T.A FROM T WHERE X > 10, the domain of vari-          As a result of this value-level heterogeneity, an aggregate
     able X is attributes in table T. Thus, given partial         query such as SELECT AVG(TEMPERATURE) does not have a
     query, only a subset of the aspects are syntactically        single true answer. Instead, depending on the choice of data
     valid. Queries in the same session, are also submitted       source combinations that are used to answer the query, dif-
     sequentially.                                                ferent answers can be generated. Reporting the entire set
                                                                  of answers can overwhelm the user. Here, mechanisms that
  3. Session and Aspect Sparsity. In SDSS, the typical ses-       summarize the results and help the user understand query
     sion has six SQL queries and lasts thirty minutes [21]       answers are required. Therefore, we study the following re-
     which indicates aspect sparsity in queries and sessions.     search question:

The relational database setting exhibits some similarities              After a query has been submitted to the system, how
to standard recommendation domains (e.g., movie): Some                  can we help the user understand and interpret the
aspects, e.g., tables, attributes, data regions, are popular            query answers?
while the majority of them are unpopular. Some sessions           Specifically, we address the problem of helping users un-
are frequent, i.e., many queries are submitted, while the         derstand aggregate query answers in integration contexts
majority are infrequent. Scalability and responsiveness is        where data is segmented across several sources. We assume
important in both domains.                                        meta-information that describes the mappings and bindings
   Analogous to our work in Section 2, our main hypothesis        between data sources is available [25].Our main concern is
is that merely recommending popular aspects is not suffi-         how to handle the value-level heterogeneity that exists in
cient for exploratory tasks. Although popular aspects can         the data, to enable the user to better understand the range
help familiarize novice users with concepts like the impor-       of possible query answers.
tant tables and attributes, given the exploratory nature of
queries in IDE, recommendations are deemed more useful if         4.1    Solution
they can help users narrow down their queries and expose             In our solution [30], we represented the answer to the ag-
relevant data regions. For example, recommending a spe-           gregate query as an answer distribution instead of a single
cific interval like b1 < BRIGHTNESS < b2 is more useful than      scalar value. We then proposed a suite of methods for ex-
just suggesting the attribute BRIGHTNESS.                         tracting statistics that convey meaningful information about
   Based on these intuitions, we will focus on recommending       the query answers. We focused on the following challenges
interesting aspects that enable data exploration and schema       1. determining which statistics best represent and answer’s
navigation for users of a relational database, and in partic-     distribution 2. efficiently computing the desired statistics.
ular, in IDE settings. Using the query log and the database       In deriving our algorithms, we assumed prior knowledge re-
synopsis, we will devise a set of aspects that include not just   garding the sources is unavailable and all sources are equal.
the relations, attributes, and user-defined functions, but also      A high coverage interval is one of the statistics we ex-
intervals of numeric attributes, e.g., b1 < BRIGHTNESS < b2 .     tract to convey the shape of the answer distribution and
                                                                                   −3
                                                                                                                                    [10] Yu-Chieh Ho, Yi-Ting Chiang, and Jane Yung-Jen Hsu.
            −3                                                                  x 10
     2
         x 10                                                              2
                                                                                                                                         Who likes it more?: mining worth-recommending items
 1.5
                               2 intv. cover 85.72% area,                 1.5
                                                                                                10 intv. cover 92.12% area,              from long tails by modeling relative preference. In WSDM,
                                                                                              length= 5288.17 (55.528435%)
                             length= 1791.83 (22.723944%)                                                                                pages 253–262, 2014.
     1                                                                     1
                                                                                                                                    [11] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative
 0.5                                                                      0.5                                                            filtering for implicit feedback datasets. In 2008 Eighth
                                                                                                                                         IEEE International Conference on Data Mining, pages
     0                                                                     0
     0.6         0.7   0.8    0.9    1   1.1   1.2   1.3   1.4      1.5
                                                                    4
                                                                 x 10
                                                                           0.6          0.8      1     1.2    1.4    1.6      1.8
                                                                                                                              4
                                                                                                                           x 10
                                                                                                                                         263–272. IEEE, 2008.
                                                                                                                                    [12] HV Jagadish, Adriane Chapman, Aaron Elkiss, Magesh
                                    (a) S1                                                           (b) S4                              Jayapandian, Yunyao Li, Arnab Nandi, and Cong Yu.
                                                                                                                                         Making database systems usable. In Proceedings of the
Figure 1: High coverage intervals tell where the majority of                                                                             2007 ACM SIGMOD international conference on
answers can be found.                                                                                                                    Management of data, pages 13–24. ACM, 2007.
                                                                                                                                    [13] Martin L Kersten, Stratos Idreos, Stefan Manegold, Erietta
                                                                                                                                         Liarou, et al. The researchers guide to the data deluge:
the intervals where the majority of viable answers can be                                                                                Querying a scientific database in just a few seconds.
                                                                                                                                         PVLDB Challenges and Visions, 3, 2011.
found. Figure 1 shows the multi-modal answer distributions
                                                                                                                                    [14] Nodira Khoussainova, YongChul Kwon, Magdalena
of the aggregate query AVG(TEMP), on Canadian climate data                                                                               Balazinska, and Dan Suciu. Snipsuggest: context-aware
(S1 ) [5] and synthetic data (S4 ) [30], and their corresponding                                                                         autocompletion for sql. Proceedings of the VLDB
high coverage intervals.                                                                                                                 Endowment, 4(1):22–33, 2010.
                                                                                                                                    [15] Joonseok Lee, Samy Bengio, Seungyeon Kim, Guy
                                                                                                                                         Lebanon, and Yoram Singer. Local collaborative ranking.
5.               SUMMARY AND OUTLOOK                                                                                                     In WWW, pages 85–96, 2014.
  The goal of my thesis is to devise techniques that facilitate                                                                     [16] Joonseok Lee, Mingxuan Sun, and Guy Lebanon. A
user interaction with data. I address three aspects:                                                                                     comparative study of collaborative filtering algorithms.
                                                                                                                                         arXiv preprint arXiv:1205.3193, 2012.
          • (Accomplished) Facilitating data exploration with rec-                                                                  [17] Andriy Mnih and Ruslan Salakhutdinov. Probabilistic
            ommender systems in standard domains (Section 2).                                                                            matrix factorization. In Advances in neural information
                                                                                                                                         processing systems, pages 1257–1264, 2007.
          • (In progress) Facilitating data exploration and query                                                                   [18] Hoang Vu Nguyen, Klemens Böhm, Florian Becker,
            composition in the relational database context (Sec-                                                                         Bertrand Goldman, Georg Hinkel, and Emmanuel Müller.
            tion 3). I am currently working on extracting a dataset,                                                                     Identifying user interests within the data space-a case
            and narrowing down the problem statement.                                                                                    study with skyserver. In EDBT, pages 641–652, 2015.
                                                                                                                                    [19] M Jordan Raddick, Ani R Thakar, Alexander S Szalay, and
          • (Accomplished) Facilitating query answer analysis by                                                                         Rafael DC Santos. Ten years of skyserver i: Tracking web
            extracting statistics and semantics about the range of                                                                       and sql e-science usage. Computing in Science &
                                                                                                                                         Engineering, 16(4):22–31, 2014.
            query answers (Section 4).
                                                                                                                                    [20] Information removed for double-blind review. Submitted
                                                                                                                                         paper, 2017.
6.               REFERENCES                                                                                                         [21] Vik Singh, Jim Gray, Ani Thakar, Alexander S Szalay,
 [1] Gediminas Adomavicius and YoungOk Kwon. Improving                                                                                   Jordan Raddick, Bill Boroski, Svetlana Lebedeva, and
     aggregate recommendation diversity using ranking-based                                                                              Brian Yanny. Skyserver traffic report-the first five years.
     techniques. TKDE, 24(5):896–911, 2012.                                                                                              arXiv preprint cs/0701173, 2007.
 [2] Julien Aligon, Matteo Golfarelli, Patrick Marcel, Stefano                                                                      [22] Saúl Vargas and Pablo Castells. Improving sales diversity
     Rizzi, and Elisa Turricchia. Similarity measures for olap                                                                           by recommending users to items. In RecSys, 2014.
     sessions. Knowledge and information systems,                                                                                   [23] Roy Villafane, Kien A Hua, Duc Tran, and Basab Maulik.
     39(2):463–489, 2014.                                                                                                                Mining interval time series. In International Conference on
 [3] Pablo Castells, Neil J. Hurley, and Saul Vargas.                                                                                    Data Warehousing and Knowledge Discovery, pages
     Recommender Systems Handbook, chapter Novelty and                                                                                   318–330. Springer, 1999.
     Diversity in Recommender Systems. Springer US, 2015.                                                                           [24] Markus Weimer, Alexandros Karatzoglou, Quoc Viet Le,
 [4] Gloria Chatzopoulou, Magdalini Eirinaki, and Neoklis                                                                                and Alex Smola. Maximum margin matrix factorization for
     Polyzotis. Query recommendations for interactive database                                                                           collaborative ranking. Advances in neural information
     exploration. In International Conference on Scientific and                                                                          processing systems, pages 1–8, 2007.
     Statistical Database Management, pages 3–18. Springer,                                                                         [25] Jian Xu and Rachel Pottinger. Integrating domain
     2009.                                                                                                                               heterogeneous data sources using decomposition
 [5] Climate Canada. Canada climate data. http://climate.                                                                                aggregation queries. Information Systems, 39(0), 2014.
     weatheroffice.gc.ca/climateData/canada_e.html, 2010.                                                                           [26] Hongzhi Yin, Bin Cui, Jing Li, Junjie Yao, and Chen Chen.
 [6] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin.                                                                                  Challenging the long tail recommendation. PVLDB,
     Performance of recommender algorithms on top-n                                                                                      5(9):896–907, 2012.
     recommendation tasks. In RecSys, 2010.                                                                                         [27] Weinan Zhang, Jun Wang, Bowei Chen, and Xiaoxue Zhao.
 [7] Julien Cumin, Jean-Marc Petit, Vasile-Marian Scuturici,                                                                             To personalize or not: a risk management perspective. In
     and Sabina Surdu. Data exploration with sql using machine                                                                           RecSys, pages 229–236, 2013.
     learning techniques. In EDBT, 2017.                                                                                            [28] Yong Zhuang, Wei-Sheng Chin, Yu-Chin Juan, and
 [8] Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei                                                                                 Chih-Jen Lin. A fast parallel sgd for matrix factorization in
     Diao. Explore-by-example: An automatic query steering                                                                               shared memory systems. In RecSys, pages 249–256, 2013.
     framework for interactive data exploration. In Proceedings                                                                     [29] Sedigheh Zolaktaf and Gail C Murphy. What to learn next:
     of the 2014 ACM SIGMOD international conference on                                                                                  recommending commands in a feature-rich environment. In
     Management of data, pages 517–528. ACM, 2014.                                                                                       ICMLA, pages 1038–1044. IEEE, 2015.
 [9] Simon Dooms, Toon De Pessemier, and Luc Martens.                                                                               [30] Zainab Zolaktaf, Jian Xu, and Rachel Pottinger. Extracting
     Movietweetings: a movie rating dataset collected from                                                                               aggregate answer statistics for integration. EDBT, 2015.
     twitter. In CrowdRec at RecSys, 2013.