=Paper=
{{Paper
|id=Vol-1882/paper11
|storemode=property
|title=Facilitating User Interaction With Data
|pdfUrl=https://ceur-ws.org/Vol-1882/paper11.pdf
|volume=Vol-1882
|authors=Zainab Zolaktaf
|dblpUrl=https://dblp.org/rec/conf/vldb/Zolaktaf17
}}
==Facilitating User Interaction With Data==
Facilitating User Interaction With Data Zainab Zolaktaf Supervised by Rachel Pottinger University of British Columbia Vancouver, B.C, Canada {zolaktaf, rap}@cs.ubc.ca ABSTRACT ranging from high school students to professional astronomers, In many domains, such as scientific computing, users can with varied levels of skills and knowledge, interact with this directly access and query data that is stored in large, and database. Furthermore, scientific databases are typically often structured, data sources. Discovering interesting pat- used for Interactive Data Exploration (IDE), where users terns and efficiently locating relevant information, however, pose exploratory queries to understand the content and find can be challenging. Users must be aware of the data con- patterns [13]. Efficiently composing queries over this data to tent and its structure, before they can query it. Further- discover interesting patterns, is one of their main challenges. more, they have to interpret the retrieved results and pos- After successfully composing the query, the next step is sibly refine their query. Essentially, to find information, the to interpret query answers. However, the retrieved results user has to engage in a repeated cycle of data exploration, can often be difficult to understand. For example, consider query composition, and query answer analysis. The focus of an aggregate query SELECT AVG(TEMPERATURE) over climate my PhD research is on designing techniques that facilitate data. In the weather domain, observational data regard- this interaction. Specifically, I examine the utility of recom- ing atmospheric conditions is collected by several weather mender systems for the data exploration and query compo- stations, satellites, and ships. For the same data point, e.g., sition phases, and propose techniques that assist users in the temperature on a given day, there can be conflicting and du- query answer analysis phase. Overall, the solutions devel- plicate values. Consequently, the aggregate query can have oped in my thesis aim to increase the efficiency and decision an overwhelming number of correct and conflicting answers. quality of users. Here, mechanisms that aid the user in understanding the query answers are required. In my thesis, I develop techniques that assist user inter- 1. INTRODUCTION action with data. I consider the data exploration and query With the advent of technology and the web, large volumes composition phase, and examine the utility of recommen- of data are generated and stored in data sources that evolve dation systems for this phase. Furthermore, I consider the and grow over time. Often, these sources are structured as query answer analysis phase and devise efficient techniques relational databases that users can directly query and ex- that provide insights about query answers. More precisely, plore. For instance, astronomical measurements are stored I study three problems: 1. how do classical recommenda- in a large relational database, called the Sloan Digital Sky tion systems perform with regards to exploration tasks in Survey (SDSS) [19, 21]. Climate data collected from various standard recommendation domains, and how can we mod- sources is integrated in relational databases and offered for ify them to facilitate data exploration more rigorously (Sec- analysis by users [5]. tion 2)? 2. what are the challenges of recommendation in the At a high level, user interaction with data involves two relational database context and which algorithms are appro- phases: a query composition phase, where the user com- priate for helping users explore data and compose queries poses and submits a query, and a query answer analysis (Section 3)? 3. how can we assist users in the query an- phase, where the user analyses query answers produced by swer analysis phase (Section 4)? Overall, I aim to develop the system. During both phases, however, users can face techniques that help users explore data and increase their problems in understanding the data. decision quality. Consider, for example, the scientific computing domain. The SDSS schema has over 88 tables, 51 views, 204 user- defined functions, and 3440 columns [14]. A variety of users, 2. FACILITATING DATA EXPLORATION WITH RECOMMENDER SYSTEMS One way to facilitate data navigation and exploration is to find and suggest items of interest to users by deploying a rec- ommendation system [6, 16, 29]. Classical recommendation systems are categorized into content-based and collaborative filtering methods. c Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. Mu- Content-based methods use descriptive features such as nich, Germany. Copyright (C) 2017 for this paper by its authors. Copying permitted for genre of movies, or user demographics, to construct informa- private and academic purposes. tive user and item profiles, and measure similarity between them. But descriptive features might not be available. Col- Algorithm P@5 R@5 L@5 C@5 Random 0.000 0.000 0.871 0.873 laborative filtering methods instead infer user interests from Pop [6] 0.051 0.080 0.000 0.002 MT-200K user interaction data. The main intuition is that users with MF [28] 0.000 0.000 1.000 0.001 similar interaction patterns have similar interests. 5D ACC [10] 0.000 0.000 0.995 0.157 The interaction data may include explicit user feedback on CofiR [24] 0.025 0.046 0.066 0.020 PureSVD [6] 0.018 0.022 0.001 0.067 items, such as user ratings on movies, or implicit feedback, θ ∗ Dyn900 [20] 0.027 0.050 0.416 0.171 Pop such as purchasing history, browsing and click logs, or query logs [11]. An important property of the interaction data is that the majority of items (users) receive (provide) little Table 1: Top-5 recommendation performance. feedback and are infrequent, while a few receive (provide) lots of feedback and are frequent. But many models only work well when there is a lot of data available, i.e., they make personalization method was independent of the underlying good recommendations for frequent users, and are biased recommendation model. toward recommending frequent items [6, 15, 17]. We evaluated our framework on several standard datasets However, recommending popular items is not sufficient for from the movie domain. Table 1 shows the top-5 recom- exploratory tasks. Users are likely already aware of popu- mendation performance for the MovieTweetings 200K (MT- lar items or can find them on their own. Concentrating on 200K) dataset [9] which contains voluntary movie rating popular items also means the system has low overall cover- tweets from users. For accuracy, we computed precision age of the item space in its recommendations. It is essential (P@5) and recall (R@5) [6] wrt the test items of users. Long- to develop methods that help users discover new items that tail accuracy (L@5) [10], is the normalized number of long- may be less common but more interesting. Therefore, we tail items in top-5 sets per user. Long-tail items are those investigate the following research question: that generate the lower 20% of the total ratings in the train set, based on the Pareto principle or the 80/20 rule [26]. How do existing recommendation models perform with Coverage (C@5) [10] is the ratio of the number of distinct regard to data exploration tasks in standard recom- items recommended to all users, to the number of items. mendation domains, and how can they be modified to We compared with non-personalized baselines: Random facilitate data exploration more rigorously? that has high coverage but low accuracy, and most popu- lar recommendation (Pop) [6], that provides accurate top-N To answer this question, we focus on top-N item recommen- sets but has low coverage and long-tail accuracy. We also dation, where the goal is to recommend the most appealing compared with personalized algorithms: matrix factoriza- set of N items to each user [6]. Informally, the problem set- tion (MF) with 40 factors, L2-regularization, and stochastic ting is as follows: we are given a log of explicit user feedback, gradient descent optimization [28], a resource allocation ap- e.g., ratings, for different items. We want to assign a set of proach that re-ranks MF (5D ACC) [10], CofiRank with re- N unseen items to each user. gression loss (CofiR ) [24], and PureSVD with 300 factors [6]. On MT-200K, we chose the non-personalized Pop algorithm 2.1 Solution as our accuracy recommender, and combined it with a dy- In our solution [20], we focused on promoting less fre- namic coverage recommender (Dyn900) introduced in [20]. quent items, or long-tail items, in top-N sets to facilitate Our personalized algorithm is denoted θ∗Dyn900 . Table 1 Pop exploration. Recommending these items introduces novelty shows that while most baselines achieve best performance in and serendipity into top-N sets, and allows users to discover either coverage or accuracy metrics, θ∗Dyn900 has high cover- Pop new items. It also increases the item-space coverage, which age, while maintaining reasonable accuracy levels. Further- increases profits for providers of the items [3, 6, 26, 22]. more, it outperforms the personalized algorithms, PureSVD Our main challenge was in promoting long-tail items in a and CofiR , in both accuracy and coverage metrics. targeted manner, and in designing responsive and scalable models. We used historical rating data to learn user pref- erence for discovering new items. The main intuition was 3. FACILITATING DATA EXPLORATION AND that the long-tail preference of user u, captured by θu∗ , de- QUERY COMPOSITION pends on the types of long-tail items she rates. Moreover, Getting information out of database systems is a major the long-tail type or weight of item i, captured by wi , de- challenge [12]. Users must be familiar with the schema to be pends on the long-tail preference of users who rate that item. able to compose queries. Some relational database systems, Based on this, we formulated a joint optimization objective e.g, SkyServer, provide a sample of example queries to aid for learning both unknown variables, θ ∗ and w. users with this task. However, compared to the size of the Next, we integrated the learned user preference estimates, database and complexity of potential queries, this sample θ ∗ , into a generic re-ranking framework to provide customized set is small and static. The problem is exacerbated as the balance between accuracy and coverage. Specifically, we de- volume of data increases, particularly for IDE. A mechanism fined a re-ranking framework that required three compo- that helps users navigate the schema and data space, and nents: 1. an accuracy recommender that was responsible for exposes relevant data regions based on their query context, recommending accurate top-N sets. 2. a coverage recom- is required. We consider using recommendation systems in mender that was responsible for suggesting top-N sets that this setting and focus on the following research question: maximized coverage across the item space, and consequently promoted long-tail items. 3. the user long-tail preference. What are the challenges of recommendation in the In contrast to prior related work [1, 10, 27], our frame- database context, and which algorithms are suitable work learned the personalization rather than optimizing us- for facilitating interactive exploration and navigation ing cross-validation or parameter tuning; in other words, our of relational databases? To answer this question, we address top-N aspect recom- Subsequently, we can use a vector-based query representa- mendation, where the goal is to suggest a set of N aspects tion model where each element denotes the presence of a cer- to the user that facilitate query composition and database tain aspect. Alternatively, a graph-based representation [23] exploration. Similar to the collaborative filtering setting in might be more suitable. After formulating similarity mea- Section 2, we analyse user interaction data, available in a sures between queries (or sessions) [2], we can use a nearest query log. Informally, the problem setting is as follows: we neighbour model to suggest relevant aspects to the user. are given a query log that is partitioned into sessions, sets In contrast to prior work that focuses on supervised learn- of queries submitted by the same user. Furthermore, we ing and query rewriting [7], we focus on aspect definition also have a relational database synopsis with information and extraction. In contrast to [4, 7, 8], we rely on the about the schema of the database (#relations, #attributes, database synopsis only. Accessing a large scientific database and foreign key constraints) and the range of numerical at- like SDSS to retrieve the entire set of tuples is expensive. In tributes. Given a new partial session, the objective is to contrast to [14] our recommendations include intervals not recommend potential query extensions, or aspects. just tables and attributes. The intermediate query format in [18] is complementary to our work. 3.1 Proposed Work To formulate an adequate solution, the following chal- 4. FACILITATING QUERY ANSWER ANAL- lenges must be addressed: YSIS After users have successfully submitted a query, their next 1. Aspect Definition. There is no clear notion of “item” challenge is to analyse and understand the query answers. or aspect in this setting. Instead, we need to find an When the answer set is small, this task is attainable. The adequate set of aspects that can be used to to cap- challenge is in examining and interpreting large, or even ture user intent and characterize queries. Given the conflicting, answer sets. exploratory nature of queries in the scientific domains, To illustrate the problem, consider again climate data the aspects should enable both schema navigation and that is reported by various sources and integrated in re- data space exploration. lational databases. Because the sources were independently created and maintained, a given data point can have mul- 2. Sequential Aspects and Domain-Specific Constraints. tiple, inconsistent values across the sources. For example, Individual elements in a SQL query are sequential and one source may have the high temperature for Vancouver there is dependency between them. For instance, in on 06/11/2006 as 17C, while another may list it as 19C. SELECT T.A FROM T WHERE X > 10, the domain of vari- As a result of this value-level heterogeneity, an aggregate able X is attributes in table T. Thus, given partial query such as SELECT AVG(TEMPERATURE) does not have a query, only a subset of the aspects are syntactically single true answer. Instead, depending on the choice of data valid. Queries in the same session, are also submitted source combinations that are used to answer the query, dif- sequentially. ferent answers can be generated. Reporting the entire set of answers can overwhelm the user. Here, mechanisms that 3. Session and Aspect Sparsity. In SDSS, the typical ses- summarize the results and help the user understand query sion has six SQL queries and lasts thirty minutes [21] answers are required. Therefore, we study the following re- which indicates aspect sparsity in queries and sessions. search question: The relational database setting exhibits some similarities After a query has been submitted to the system, how to standard recommendation domains (e.g., movie): Some can we help the user understand and interpret the aspects, e.g., tables, attributes, data regions, are popular query answers? while the majority of them are unpopular. Some sessions Specifically, we address the problem of helping users un- are frequent, i.e., many queries are submitted, while the derstand aggregate query answers in integration contexts majority are infrequent. Scalability and responsiveness is where data is segmented across several sources. We assume important in both domains. meta-information that describes the mappings and bindings Analogous to our work in Section 2, our main hypothesis between data sources is available [25].Our main concern is is that merely recommending popular aspects is not suffi- how to handle the value-level heterogeneity that exists in cient for exploratory tasks. Although popular aspects can the data, to enable the user to better understand the range help familiarize novice users with concepts like the impor- of possible query answers. tant tables and attributes, given the exploratory nature of queries in IDE, recommendations are deemed more useful if 4.1 Solution they can help users narrow down their queries and expose In our solution [30], we represented the answer to the ag- relevant data regions. For example, recommending a spe- gregate query as an answer distribution instead of a single cific interval like b1 < BRIGHTNESS < b2 is more useful than scalar value. We then proposed a suite of methods for ex- just suggesting the attribute BRIGHTNESS. tracting statistics that convey meaningful information about Based on these intuitions, we will focus on recommending the query answers. We focused on the following challenges interesting aspects that enable data exploration and schema 1. determining which statistics best represent and answer’s navigation for users of a relational database, and in partic- distribution 2. efficiently computing the desired statistics. ular, in IDE settings. Using the query log and the database In deriving our algorithms, we assumed prior knowledge re- synopsis, we will devise a set of aspects that include not just garding the sources is unavailable and all sources are equal. the relations, attributes, and user-defined functions, but also A high coverage interval is one of the statistics we ex- intervals of numeric attributes, e.g., b1 < BRIGHTNESS < b2 . tract to convey the shape of the answer distribution and −3 [10] Yu-Chieh Ho, Yi-Ting Chiang, and Jane Yung-Jen Hsu. −3 x 10 2 x 10 2 Who likes it more?: mining worth-recommending items 1.5 2 intv. cover 85.72% area, 1.5 10 intv. cover 92.12% area, from long tails by modeling relative preference. In WSDM, length= 5288.17 (55.528435%) length= 1791.83 (22.723944%) pages 253–262, 2014. 1 1 [11] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative 0.5 0.5 filtering for implicit feedback datasets. In 2008 Eighth IEEE International Conference on Data Mining, pages 0 0 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 4 x 10 0.6 0.8 1 1.2 1.4 1.6 1.8 4 x 10 263–272. IEEE, 2008. [12] HV Jagadish, Adriane Chapman, Aaron Elkiss, Magesh (a) S1 (b) S4 Jayapandian, Yunyao Li, Arnab Nandi, and Cong Yu. Making database systems usable. In Proceedings of the Figure 1: High coverage intervals tell where the majority of 2007 ACM SIGMOD international conference on answers can be found. Management of data, pages 13–24. ACM, 2007. [13] Martin L Kersten, Stratos Idreos, Stefan Manegold, Erietta Liarou, et al. The researchers guide to the data deluge: the intervals where the majority of viable answers can be Querying a scientific database in just a few seconds. PVLDB Challenges and Visions, 3, 2011. found. Figure 1 shows the multi-modal answer distributions [14] Nodira Khoussainova, YongChul Kwon, Magdalena of the aggregate query AVG(TEMP), on Canadian climate data Balazinska, and Dan Suciu. Snipsuggest: context-aware (S1 ) [5] and synthetic data (S4 ) [30], and their corresponding autocompletion for sql. Proceedings of the VLDB high coverage intervals. Endowment, 4(1):22–33, 2010. [15] Joonseok Lee, Samy Bengio, Seungyeon Kim, Guy Lebanon, and Yoram Singer. Local collaborative ranking. 5. SUMMARY AND OUTLOOK In WWW, pages 85–96, 2014. The goal of my thesis is to devise techniques that facilitate [16] Joonseok Lee, Mingxuan Sun, and Guy Lebanon. A user interaction with data. I address three aspects: comparative study of collaborative filtering algorithms. arXiv preprint arXiv:1205.3193, 2012. • (Accomplished) Facilitating data exploration with rec- [17] Andriy Mnih and Ruslan Salakhutdinov. Probabilistic ommender systems in standard domains (Section 2). matrix factorization. In Advances in neural information processing systems, pages 1257–1264, 2007. • (In progress) Facilitating data exploration and query [18] Hoang Vu Nguyen, Klemens Böhm, Florian Becker, composition in the relational database context (Sec- Bertrand Goldman, Georg Hinkel, and Emmanuel Müller. tion 3). I am currently working on extracting a dataset, Identifying user interests within the data space-a case and narrowing down the problem statement. study with skyserver. In EDBT, pages 641–652, 2015. [19] M Jordan Raddick, Ani R Thakar, Alexander S Szalay, and • (Accomplished) Facilitating query answer analysis by Rafael DC Santos. Ten years of skyserver i: Tracking web extracting statistics and semantics about the range of and sql e-science usage. Computing in Science & Engineering, 16(4):22–31, 2014. query answers (Section 4). [20] Information removed for double-blind review. Submitted paper, 2017. 6. REFERENCES [21] Vik Singh, Jim Gray, Ani Thakar, Alexander S Szalay, [1] Gediminas Adomavicius and YoungOk Kwon. Improving Jordan Raddick, Bill Boroski, Svetlana Lebedeva, and aggregate recommendation diversity using ranking-based Brian Yanny. Skyserver traffic report-the first five years. techniques. TKDE, 24(5):896–911, 2012. arXiv preprint cs/0701173, 2007. [2] Julien Aligon, Matteo Golfarelli, Patrick Marcel, Stefano [22] Saúl Vargas and Pablo Castells. Improving sales diversity Rizzi, and Elisa Turricchia. Similarity measures for olap by recommending users to items. In RecSys, 2014. sessions. Knowledge and information systems, [23] Roy Villafane, Kien A Hua, Duc Tran, and Basab Maulik. 39(2):463–489, 2014. Mining interval time series. In International Conference on [3] Pablo Castells, Neil J. Hurley, and Saul Vargas. Data Warehousing and Knowledge Discovery, pages Recommender Systems Handbook, chapter Novelty and 318–330. Springer, 1999. Diversity in Recommender Systems. Springer US, 2015. [24] Markus Weimer, Alexandros Karatzoglou, Quoc Viet Le, [4] Gloria Chatzopoulou, Magdalini Eirinaki, and Neoklis and Alex Smola. Maximum margin matrix factorization for Polyzotis. Query recommendations for interactive database collaborative ranking. Advances in neural information exploration. In International Conference on Scientific and processing systems, pages 1–8, 2007. Statistical Database Management, pages 3–18. Springer, [25] Jian Xu and Rachel Pottinger. Integrating domain 2009. heterogeneous data sources using decomposition [5] Climate Canada. Canada climate data. http://climate. aggregation queries. Information Systems, 39(0), 2014. weatheroffice.gc.ca/climateData/canada_e.html, 2010. [26] Hongzhi Yin, Bin Cui, Jing Li, Junjie Yao, and Chen Chen. [6] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. Challenging the long tail recommendation. PVLDB, Performance of recommender algorithms on top-n 5(9):896–907, 2012. recommendation tasks. In RecSys, 2010. [27] Weinan Zhang, Jun Wang, Bowei Chen, and Xiaoxue Zhao. [7] Julien Cumin, Jean-Marc Petit, Vasile-Marian Scuturici, To personalize or not: a risk management perspective. In and Sabina Surdu. Data exploration with sql using machine RecSys, pages 229–236, 2013. learning techniques. In EDBT, 2017. [28] Yong Zhuang, Wei-Sheng Chin, Yu-Chin Juan, and [8] Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Chih-Jen Lin. A fast parallel sgd for matrix factorization in Diao. Explore-by-example: An automatic query steering shared memory systems. In RecSys, pages 249–256, 2013. framework for interactive data exploration. In Proceedings [29] Sedigheh Zolaktaf and Gail C Murphy. What to learn next: of the 2014 ACM SIGMOD international conference on recommending commands in a feature-rich environment. In Management of data, pages 517–528. ACM, 2014. ICMLA, pages 1038–1044. IEEE, 2015. [9] Simon Dooms, Toon De Pessemier, and Luc Martens. [30] Zainab Zolaktaf, Jian Xu, and Rachel Pottinger. Extracting Movietweetings: a movie rating dataset collected from aggregate answer statistics for integration. EDBT, 2015. twitter. In CrowdRec at RecSys, 2013.