Interactive Preference Elicitation for Scientific and Cultural Recommendations Eduardo Veas1,2 and Cecilia di Sciascio2 1 Information and Communications Technologies, National University of Cuyo 2 Knowledge Visualization, Know-Center GmbH eduveas@gmail.com, cdissciascio@know-center.at Abstract (scrutability) [Kay, 2006]. Hence, to warrant increased user involvement the RS has to justify recommendations and let This paper presents a visual interface developed on the user customize their generation. Transparency and con- the basis of control and transparency to elicit pref- trollability are key facilities of a self-explanatory RS that pro- erences in the scientific and cultural domain. Pref- mote trust and satisfaction [Tintarev and Masthoff, 2012] erence elicitation is a recognized challenge in user Our work is set in the scientific and cultural domain. In modeling for personalized recommender systems. this frame, users are most often engaged in exploration and The amount of feedback the user is willing to pro- production tasks that involve gathering and organizing large vide depends on how trustworthy the system seems collections in preparatory steps (e.g., for writing, preparing to be and how invasive the elicitation process is. a lecture or presentation). A federated system (FS) com- Our approach ranks a collection of items with a piles scientific documents or electronic cultural content (im- controllable text analytics model. It integrates con- ages) upon an explicit or implicit query, with little control trol with the ranking and uses it as implicit prefer- over the way results are generated. Content takes the form of ence for content based recommendations. text document surrogates comprising title and abstract. They also include minimal additional metadata, like creator, URL, 1 Introduction provider and year of publication. A recommender system (RS) depends on a model of a user to This paper introduces a visual tool to support exploration be accurate. To build a model of the user, behavioral recom- of scientific and cultural collections. The approach includes menders collect preferences from browsing and purchasing a metaphor to represent a set of documents, with which the history, whereas rating recommenders require a user to rate user interacts to understand and define themes of interest. a set of items to state their preferences (implicit and explicit The contribution of this work is the interactive personaliza- methods respectively) [Pu et al., 2011]. Preference elicitation tion feature that, instead of presenting a static ranked list, al- is fundamental for the whole operational lifecycle of a RS: it lows users to dynamically re-sort the document set in the vi- affects the recommendations for a new user and also those of sual representation and re-calculate relevance scores with re- the whole system community, given what the RS learns from gards to the own interests. The visual interface employs con- each new user [Cremonesi et al., 2012]. Whichever method trollable methods to represents their results in a transparent is chosen, preference elicitation represents an added effort, manner which, rather than adding effort, reduces complexity which may be willingly avoided to the detriment of user satis- of the overall task. faction. The amount of feedback the user is willing to provide is a tradeoff between system aspects and personal character- 2 The Approach istics, for example privacy vs recommendation quality [Knij- The proposed approach was designed to quickly reorganize nenburg et al., 2012]. a large collection in terms of its relevance to a set of key- In their seminal work, Swearingen et al. pointed out one words expressing the choice of topic. In a nutshell, the goal challenge: the recommender has to convince the user to try is to interactively discover the topics in a collection, build- the recommended items [Swearingen and Sinha, 2001]. To ing the knowledge in the user. But, instead of trying to in- do so, the recommendation algorithm has to propose items fer a hidden topic structure fully automatically (as in [Blei, effectively, but also the interfaces must deliver recommenda- 2012]), we propose an interactive approach, which works as tions in a way that can be compared and explained [Ricci et a conversation between the user and the RS to build a per- al., 2011]. The willingness to provide feedback is directly re- sonalized theme structure. Controllability and transparency lated to the overall perception and satisfaction the user has are crucial for the user to understand how a topic came about of the RS [Knijnenburg et al., 2012]. Explanation inter- from their personal exploration. The challenge for the inter- faces increase confidence in the system (trust) by explaining face is to clearly explain the recommendation process, and for how the system works (transparency) [Tintarev and Masthoff, the analytics method to reduce the computational problem to 2012] and allowing users to tell the system when it is wrong interactive terms. Figure 1: (Left) TagBox. A summary of the collection contents as a bag of words. (Right) The RankView is updated as two terms have been selected. As the user points at a third keyword (employment), a hint shows which documents would be affected by picking it (3 highlighted documents). 2.1 Visual Interface 2.2 Text Analytics and Ranking To search and explore documents based on the themes that Keyword extraction plays two roles: it summarizes the top- run through them, we build an interface that allows the user ics in the collection, and it also provides the basis for the to establish a conversation with the RS. Two main parts of fast ranking of documents. Preprocessing involves part-of- the interface comprise the topic summary and the recommen- speech tagging, singularizing plural nouns, and stemming dation pane. The topic summary is built from keywords ex- with a Porter Stemmer. Resulting terms form a document vec- tracted from the whole collection. Keywords are presented in tor, which also constitutes its index. Subsequently, individual a Tag Box, organized and encoded in terms of their frequency terms are scored with TF-IDF (term frequency - inverse doc- of occurrence in the collection (tf-idf), see Fig. 1. The rec- ument frequency). It how important a term is to a document ommendation list initially shows the unranked collection. in a collection, as the coefficient between its frequency in a As the user interacts with the contents choosing words to document and the logarithm of the times it is repeated in the express her information needs, the recommendation list is collection of documents. The more frequent a term is in a ranked on-the-fly (see Fig. 1). The RankView shows the con- document and the fewer times it appears in the corpora, the tribution each keyword has on the overall score of a docu- higher its score will be. TF-IDF scored terms are added to the ment. With a slider, the user can assign a weight to a key- metadata of each document. To provide an overview of the word and modify its contribution to the score. Furthermore, contents in the collection, keywords from all documents are the TagBox and RankView illustrate the possible effect of collected in a global set of keywords. Global keywords are user actions in a quick overview: mouse over a keyword in sorted by the accumulated document frequency (DF), calcu- the TagBox shows a micro-chart with the proportion of docu- lated as the number of documents in which a keyword appears ments affected and the RankView highlights those documents - regardless of the frequency within the documents. in view that would be affected by choosing the keyword. Quick exploration of content depends on quickly re-sorting It is important to note that the user is aware and in control the documents according the information needs of the user, of the ranking and organization of the document collection at expressed with a query built from a subset of the global key- all times. With the visual interface, the user describes her in- word collection. We assume that some keywords are more formation needs and chooses documents from the collection important to the topic model than others and allow the user to that better reflect those needs. Chosen items can be assigned assign weights to them. to a collection. The act of choosing an item is considered The documents in the set are then ranked and sorted as an expression of preference. With the collection, the system follows. Given a set of documents D = d1 , ..., dn , a set of stores keywords and score of each document. Although this keywords: K = k1 , ..., km and a set of selected keywords: feedback is not yet incorporated in our ranking approach, we T = t1 , ..., tp , T ⊆ K; the overall score for document di is analyze its effects with a user study and outline future direc- calculated as the sum of the weighted scores of its keywords tions to integrate this additional information in the system. matching selected keywords: two iterations of this task for each condition. The short text ∑ p task required participants to come up with the keywords de- sdi = wtj × mli tj , scribing the topic by themselves. j=1 Twenty four (24) participants took part in the study (11 Where wtj is the weight assigned by the user to the selected fem., 13 m., between 22 and 37 years old). They were re- keyword tj , such that ∀j : 0 ≤ wtj ≤ 1; and mdi tj is the tf- cruited from the medical university and from computer sci- idf score for keyword tj in document di . D is next sorted by ence university graduate population. None of them is major- overall score using the quicksort algorithm. Documents in D ing in the topic areas selected for the study. are now elements of sequence Q with order determined by: Procedure Q = (qi )ni=1 , qi , qi+1 ∈ D ∧ sqi ≥ sqi+1 . A study session started with an intro video, which explained the functionality of the tool. Each participant got exactly the Finally, the ranking position is calculated in such a way that same instructions. There was a short training session on a items with equivalent overall score share the same position. dummy dataset to let participants familiarize with the tool. The position for a sorted document qi is calculated as Thereafter, the first condition started. The system showed a { short text to introduce the topic. After reading the text, partic- 1 if i = 0 rqi = rqi−1 if sqi = sqi−1 ipants pressed start, opening the interface for the first task. At rqi−1 + |C| if sqi ¡ sqi−1 the beginning of the task, the items in the collection were or- dered randomly, ensuring that an item would not appear in the Where C = qj /sqj = sqj−1 , 0 ≤ j ≤ i represents the set same position again. The instructions for the task were shown of all the items with immediate superior overall score than qi . in the upper part of the screen. In all conditions participants The current approach employs a term-frequency-based were able to collect items and inspect their collections. In the scheme to compute document scores, as it is more appropriate (L) condition the main interface was a list of items, whereas to compute and highlight individual term contributions than a the (U) condition used the proposed interface. Participants single similarity measure. had to click the finished button to conclude the task. It was possible to finish without collecting all items. After each con- 3 Experimental Setup dition, participants had to fill a NASA TLX questionnaire to assess cognitive load, performance and effort among others. We performed a preliminary study to determine if controlla- bility and transparency increase the complexity and pose an The procedure was repeated for each of the four iterations. extra effort in the task of building topic oriented document Thereafter participants were interviewed for comments. collections. Thus, participants had the task to “gather relevant 3.2 Results items” using our tool (U), or using a recommendation list (L) with usual tools (keyword search).We chose two variations of NASA TLX data were analyzed using a repeated measures size of the dataset in terms of item count S(30), L(60). ANOVA with independent variables tool, and dataset size. Post-hoc effects were computed using Bonferroni corrected 3.1 Method pairwise comparisons. The two by two experimental design The study was structured as a repeated measures design, with ensures that sphericity is necessarily met. A repeated mea- four iterations of the same tasks, each with a different combi- sures ANOVA revealed a significant effect of tool on per- nation of the independent variables (e.g., US-LL-UL-LS). To ceived workload F(1,23)=35.254, p < 0.01, ϵ = 0.18. A counter the effects of habituation, we used four topics cov- Post-hoc paired-samples t-test revealed a significantly lower ering a spectrum of cultural, technical and scientific content: workload when using uRank (p < 0.001). Further, repeated women in the workforce (WW), robotics (RO), augmented measures ANOVA in each dimension of the workload mea- reality (AR), circular economy (CE). Each of these topics has sure showed significant effects of tool in all dimensions as a well defined wikipedia page, which was used as seed to shown in Table 1. retrieve a collection from a federated system. The system To test the proposed recommender, we gathered and com- creates a query from the text of the page and forwards it to pared for each topic (WW, RO, AR, CE) the most popular a number of content providers. The result is a joint list of items collected using the list (L-MP) and our approach (U- items from each provider. The federated system cannot es- MP) with the scores received by our ranking algorithm (U). tablish how relevant the items are. Furthermore, the resulting collection refers to the whole text, but there is no indication Table 1: Complexity: people found our tool incurs signifi- of subtopics. We collected sets of 60 and 30 items as static cantly lower workload in all dimensions datasets for each topic. We simulated the proposed scenario of reorganizing the collection by choosing subtopics for each Dimension F(1,23) p ϵ task in the study. The combinations were randomized and Mental Demand 19.700 p < 0.05 0.10 Physical Demand 14.520 p < 0.01 0.07 assigned using a balanced Latin Square. Temporal Demand 7.720 p < 0.05 0.05 Each condition had two fundamental tasks: find items most Performance 11.800 p < 0.01 0.10 relevant to a set of given keywords, find items most relevant Effort 48.600 p < 0.001 0.22 Frustration 15.120 p < 0.01 0.07 to a short text. In the former, participants were given the key- Workload 35.254 p < 0.01 0.20 words and they just had to explore the collection. There were WW RO U U_MP q1 L_MP U U_MP q2 L_MP U U_MP q3 L_MP AR CE U U_MP q1 L_MP U U_MP q2 L_MP U U_MP q3 L_MP items items Figure 2: Correlation heatmap. Most popular items collected with our tool (U MP) had high scores in the topic based ranking (U). Most popular items collected with the list (L MP) are more widespread. The ranking (U), produced many high scoring items (WW-q1, RO-q1, RO-q2), indicating that personalized ranking may be more appropriate. Table 2: Correlation analysis: ICCs established good to funded by the Austrian Research Promotion Agency (FFG) excellent correlations between most-popular items collected under the COMET Program. with List(L-MP), our tool (U-MP), and the ranking scores. References Task ww-q1 ww-q2 ww-q3 ro-q1 ro-q2 ro-q3 ICC .658 .862 .751 .857 .835 .738 [Blei, 2012] David M. Blei. Probabilistic topic models. Task ar-q1 ar-q2 ar-q3 ce-q1 ce-q2 ce-q3 Commun. ACM, 55(4):77–84, April 2012. ICC .814 .869 .813 .911 .866 .707 [Cremonesi et al., 2012] Paolo Cremonesi, Franca Gar- zottto, and Roberto Turrin. User effort vs. accuracy in We performed intra-class correlations (ICC), using a two- rating-based elicitation. In Proc. of the Sixth ACM Conf. way, consistency, average measures model. Results are sum- on Recommender Systems, RecSys ’12, pg. 27–34, New marized in Table 2. For broad exploration (q1 & q2), we York, NY, USA, 2012. ACM. found good to excellent ICCs. A closer look at the distribu- [Kay, 2006] Judy Kay. Scrutable adaptation: Because we tion of scores in Fig. 2 underlines the fact that high ranked can and must. In Vincent P. Wade, Helen Ashman, and documents (U) were a popular choice with U MP and also Barry Smyth, editors, AH, volume 4018 of Lecture Notes relatively popular with L MP. For q3, the ranking (U) pro- in Computer Science, pg. 11–19. Springer, 2006. duced widespread scores with less individual favorites, items [Knijnenburg et al., 2012] Bart P. Knijnenburg, Martijn C. L MP were generally least popular. U MP resulted in the Willemsen, Zeno Gantner, Hakan Soncu, and Chris most focused of the three (less blocks with higher intensity). Newell. Explaining the user experience of recommender systems. User Modeling and User-Adapted Interaction, 4 Discussion and Outlook 22(4-5):441–504, October 2012. Results show that the fast-ranking method in our content rec- [Pu et al., 2011] Pearl Pu, Li Chen, and Rong Hu. A user- ommender helps users quickly reorganize collections. The centric evaluation framework for recommender systems. preference elicitation method was well received and quickly In Proc. of the Fifth ACM Conf. on Recommender Systems, adopted. Participants experienced less effort and overall RecSys ’11, pg. 157–164, New York, NY, USA, 2011. workload using our tool. Still, they took time to check their ACM. choices carefully in both U and L conditions. Comparing most popular choices after the experiment reinforces our as- [Ricci et al., 2011] Francesco Ricci, Lior Rokach, Bracha sumptions: the fast-ranking method (U) correlates with most Shapira, and Paul B. Kantor. Introduction to Recom- popular choices made with the tool (U MP) but also without mender Systems Handbook, pg. 1–35. Springer US, 2011. it (L MP). Yet, widespread results in some cases call for a [Swearingen and Sinha, 2001] K. Swearingen and R. Sinha. personalized recommendation method. Our preference elici- Beyond algorithms: An hci perspective on recommender tation forms the backbone of personalized recommendations. systems. In ACM SIGIR. Workshop on Recommender Sys- In the future we will explore recommendations of related tems, volume Vol. 13, Numbers 5-6, pg. 393–408, 2001. items in context, showing keywords used to collect the item, [Tintarev and Masthoff, 2012] Nava Tintarev and Judith other items collected together and under which collections. Masthoff. Evaluating the effectiveness of explanations for recommender systems. User Modeling and User-Adapted Acknowledgments Interaction, 22(4-5):399–439, October 2012. This work is partially funded by CONICET (project Visual- I-Lab, Res. 4050-13) and by Know-Center. Know-Center is